English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 21 June 2023, 16:42   #21
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
I just noticed that 4 values for the CMYW model in the colors table were wrong: I fixed them and uploaded a new archive.
While at it, given the request above, I whipped up an AMOS Professional program that shows how to set up a PED81C screen and to perform some basic operations on it - hopefully, this will be easy to understand and also open the door to AMOS programmers. The program source is included in the archive.

Code:
'-----------------------------------------------------------------------------
'$VER: PED81C example 1.3 (28.11.2023) (c) 2023 RETREAM
'Legal terms: please refer to the accompanying documentation.
'www.retream.com/PED81C
'contact@retream.com
'-----------------------------------------------------------------------------

'-----------------------------------------------------------------------------
'DESCRIPTION
'This shows how to set up a PED81C screen and to perform some basic operations
'on it.
'Screen features:
' * equivalent to a 319x256 LORES screen
' * 160 dots wide raster
' * single buffer
' * blanked border
' * 64-bit bitplanes fetch mode
' * CMYW color model
'
'NOTES
'The code is written to be readable, not to be general-purpose/optimal.
'-----------------------------------------------------------------------------

'-----------------------------------------------------------------------------
'GLOBAL VARIABLES

Global RASTERADDRESS,RASTERWIDTH,RASTERHEIGHT,RASTERSIZE

RASTERWIDTH=160
RASTERHEIGHT=256
RASTERSIZE=RASTERWIDTH*RASTERHEIGHT

'-----------------------------------------------------------------------------
'MAIN

'Initialize everything.

_INITIALIZE_AMOS_ENVIRONMENT
_INITIALIZE_SCREEN

'If the initialization succeeded, load a picture into the raster and, in case
'of success, execute a simple effect on it.

If Param
   _LOAD_PICTURE_INTO_RASTER["picture-160x256.raw"]
   If Param
      _TURN_DISPLAY_DMA_ON[0]
      _RANDOMIZE_RASTER
      _TURN_DISPLAY_DMA_OFF
   End If
End If

'Deinitialize everything.

_DEINITIALIZE_SCREEN
_RESTORE_AMOS_ENVIRONMENT

'-----------------------------------------------------------------------------
'ROUTINES

Procedure _ALLOCATE_BITPLANE[BANKINDEX,SIZE]
   '--------------------------------------------------------------------------
   'DESCRIPTION
   'Allocates a CHIP RAM buffer to be used as a bitplane.
   '
   'INPUT
   'BANKINDEX = index of bank to use
   'SIZE      = size [bytes] of bitplane
   '
   'OUTPUT
   '64-bit-aligned bitplane address (0 = error)
   '
   'WARNINGS
   'The buffer must be freed with Erase BANKINDEX or Erase All.
   '--------------------------------------------------------------------------

   Trap Reserve As Chip Data BANKINDEX,SIZE+8
   If Errtrap=0 Then A=(Start(BANKINDEX)+7) and $FFFFFFF8

End Proc[A]
Procedure _DEINITIALIZE_SCREEN
   '--------------------------------------------------------------------------
   'DESCRIPTION
   'Deinitializes the screen.
   '
   'WARNINGS
   'Can be called only if the display is off.
   '--------------------------------------------------------------------------

   Erase All
   Doke $DFF1FC,0 : Rem FMODE

End Proc
Procedure _INITIALIZE_AMOS_ENVIRONMENT
   '--------------------------------------------------------------------------
   'DESCRIPTION
   'Ensures the program cannot be interrupted or brought to back, and turns
   'off the AMOS video system.
   '--------------------------------------------------------------------------

   Break Off
   Amos Lock
   Comp Test Off
   Auto View Off
   Update Off
   Copper Off
   _TURN_DISPLAY_DMA_OFF

End Proc
Procedure _INITIALIZE_SCREEN
   '--------------------------------------------------------------------------
   'DESCRIPTION
   'Initializes the screen.
   '
   'OUTPUT
   '-1/0 = OK/error
   '
   'WARNINGS
   '_DEINITIALIZE_SCREEN[] must be called also in case of failure.
   '
   'NOTES
   'Sets RASTERADDRESS.
   '--------------------------------------------------------------------------

   'Allocate the raster.

   _ALLOCATE_BITPLANE[10,RASTERSIZE] : If Param=0 Then Pop Proc[0]
   RASTERADDRESS=Param

   'Allocate and fill the selector bitplanes.

   _ALLOCATE_BITPLANE[11,RASTERSIZE] : If Param=0 Then Pop Proc[0]
   B3A=Param
   Fill B3A To B3A+RASTERSIZE,$55555555

   _ALLOCATE_BITPLANE[12,RASTERSIZE] : If Param=0 Then Pop Proc[0]
   B4A=Param
   Fill B4A To B4A+RASTERSIZE,$33333333

   'Set the chipset.

   DIWSTRTX=$81+(160-RASTERWIDTH)
   DIWSTRTY=$2C+(128-RASTERHEIGHT/2)
   DIWSTRT=((DIWSTRTY and $FF)*256) or((DIWSTRTX+1) and $FF)
   DIWSTOPX=DIWSTRTX+RASTERWIDTH*2
   DIWSTOPY=DIWSTRTY+RASTERHEIGHT
   DIWSTOP=((DIWSTOPY and $FF)*256) or(DIWSTOPX and $FF)
   DIWHIGH=((DIWSTOPX and $100)*32) or(DIWSTOPY and $700) or((DIWSTRTX and $100)/8) or(DIWSTRTY/256)
   DDFSTRT=(DIWSTRTX-17)/2
   DDFSTOP=DDFSTRT+RASTERWIDTH-8

   Doke $DFF092,DDFSTRT
   Doke $DFF094,DDFSTOP
   Doke $DFF08E,DIWSTRT
   Doke $DFF090,DIWSTOP
   Doke $DFF1E4,DIWHIGH

   Doke $DFF100,$4241 : Rem BPLCON0
   Doke $DFF102,$10 : Rem BPLCON1
   Doke $DFF104,$224 : Rem BPLCON2
   Doke $DFF108,0 : Rem BPLMOD1
   Doke $DFF10A,0 : Rem BPLMOD2
   Doke $DFF1FC,$3 : Rem FMODE

   'Set COLORxx.

   Doke $DFF106,$20 : Rem BPLCON3
   Doke $DFF180,0
   Doke $DFF182,$88
   Doke $DFF184,$88
   Doke $DFF186,$FF
   Doke $DFF188,0
   Doke $DFF18A,$808
   Doke $DFF18C,$808
   Doke $DFF18E,$F0F
   Doke $DFF190,0
   Doke $DFF192,$880
   Doke $DFF194,$880
   Doke $DFF196,$FF0
   Doke $DFF198,0
   Doke $DFF19A,$888
   Doke $DFF19C,$888
   Doke $DFF19E,$FFF
   Doke $DFF106,$220 : Rem BPLCON3
   Doke $DFF180,0
   Doke $DFF182,0
   Doke $DFF184,0
   Doke $DFF188,0
   Doke $DFF18A,0
   Doke $DFF18C,0
   Doke $DFF190,0
   Doke $DFF192,0
   Doke $DFF194,0
   Doke $DFF198,0
   Doke $DFF19A,0
   Doke $DFF19C,0
   Doke $DFF106,$20 : Rem BPLCON3

   'Build a Copperlist that sets the bitplanes pointers.

   Cop Movel $E0,RASTERADDRESS
   Cop Movel $E4,RASTERADDRESS
   Cop Movel $E8,B3A
   Cop Movel $EC,B4A
   Cop Swap

End Proc[-1]
Procedure _LOAD_PICTURE_INTO_RASTER[FILEPATH$]
   '--------------------------------------------------------------------------
   'DESCRIPTION
   'Loads a raw 8-bit chunky picture into the raster, ensuring that its size
   'is correct.
   '
   'IN
   'FILEPATHS = path of picture file
   '
   'OUTPUT
   '-1/0 = OK/error
   '--------------------------------------------------------------------------

   Trap Open In 1,FILEPATH$ : If Errtrap Then Pop Proc[0]
   L=Lof(1)
   Close(1)
   If L<>RASTERSIZE Then Pop Proc[0]
   Trap Bload FILEPATH$,RASTERADDRESS

End Proc[Errtrap=0]
Procedure _RANDOMIZE_RASTER
   '--------------------------------------------------------------------------
   'DESCRIPTION
   'Randomizes the raster by swapping 16 dots per frame, until a mouse button
   'is pressed.
   '--------------------------------------------------------------------------

   XM=RASTERWIDTH-1
   YM=RASTERHEIGHT-1
   Repeat
      C=16
      While C
         X0=Rnd(XM)
         Y0=Rnd(YM)
         X1=Rnd(XM)
         Y1=Rnd(YM)
         A0=Y0*RASTERWIDTH+X0+RASTERADDRESS
         A1=Y1*RASTERWIDTH+X1+RASTERADDRESS
         C0=Peek(A0)
         Poke A0,Peek(A1)
         Poke A1,A0
         Dec C
      Wend
      _WAIT_SCREEN_BOTTOM
   Until Mouse Click

End Proc
Procedure _RESTORE_AMOS_ENVIRONMENT
   '--------------------------------------------------------------------------
   'DESCRIPTION
   'Restores the AMOS environment.
   '--------------------------------------------------------------------------

   Copper On
   Update On
   Auto View On
   Amos Unlock
   Break On
   _TURN_DISPLAY_DMA_ON[$20]

End Proc
Procedure _TURN_DISPLAY_DMA_OFF
   '--------------------------------------------------------------------------
   'DESCRIPTION
   'Disables the bitplanes, Copper and sprites DMA.
   '--------------------------------------------------------------------------

   _WAIT_SCREEN_BOTTOM
   Doke $DFF096,$3A0 : Rem DMACON

End Proc
Procedure _TURN_DISPLAY_DMA_ON[SSPRITESFLAG]
   '--------------------------------------------------------------------------
   'DESCRIPTION
   'Enables the bitplanes and Copper DMA.
   '
   'INPUT
   'SSPRITESFLAG = $20/0 = turn / do not turn sprites on
   '
   'WARNINGS
   'The chipset must have been set up properly.
   '--------------------------------------------------------------------------

   _WAIT_SCREEN_BOTTOM
   Doke $DFF096,$8380 or SSPRITESFLAG : Rem DMACON

End Proc
Procedure _WAIT_SCREEN_BOTTOM
   '--------------------------------------------------------------------------
   'DESCRIPTION
   'Waits for the bottom of the screen.
   '--------------------------------------------------------------------------

   While Deek($DFF004) and $3 : Wend
   Repeat : Until(Leek($DFF004) and $3FF00)>$12C00

End Proc

Last edited by saimo; 29 November 2023 at 13:07. Reason: Updated source code.
saimo is offline  
Old 02 July 2023, 22:37   #22
A500
Registered User
 
Join Date: Jun 2017
Location: Finland
Posts: 362
This is cool!
A500 is offline  
Old 28 November 2023, 23:42   #23
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
I have just released a little update, accompanied by the PED81C Voxel Engine (PVE), i.e. a new demo. If you can't be bothered trying it yourself, you can see it in this video - but beware: YouTube's video compression degraded the visual quality (especially the colors saturation and brightness).

[ Show youtube player ]

Details about PVE straight from the manual:
Code:
--------------------------------------------------------------------------------
OVERVIEW

PVE is an experiment to test the graphical quality and computational performance
of the PED81C system. It allows to move freely around a typical voxel landscape.


--------------------------------------------------------------------------------
GETTING STARTED

PVE requires:
 * Amiga computer
 * AGA chipset
 * 80 kB of CHIP RAM
 * 4 MB of FAST RAM
 * PAL SHRES support
 * digital joystick and keyboard
 * 2.1 MB of storage space

To install PVE, unpack the LhA archive to any directory of your choice.

To start PVE, open the program directory and double-click the program icon from
Workbench or execute the program from shell.

Shell arguments:
 CACHECOPYBACK=CC/S: make the 68040/68060 data cache work in copyback mode
 CACHESWITCHING=CS/S: switch off the 68030 data cache burst or the 68040/68060
                      data cache while rendering the voxel
 RUNBENCHMARK=RB/S: benchmark graphics rendering

If your monitor / graphics card / scan doubler do(es) not support SHRES, the
colors will look off or even not show at all. In such case, to hopefully fix the
colors a bit, try the staggered lines option.


--------------------------------------------------------------------------------
CONTROLS

PVE is controlled by joystick (in the game port) and keyboard.

JOYSTICK | KEYBOARD | SPLASH SCREEN               | VOXEL SCREEN
---------+----------+-----------------------------+----------------------------
[UP]     |          |                             | move forwards
[DOWN]   |          |                             | move backwards
[LEFT]   |          |                             | turn left
[RIGHT]  |          |                             | turn right
[FIRE1]  |          | go to voxel screen          | accelerate
         | [F1]     | turn staggered lines on/off | turn staggered lines on/off
         | [F2]     | turn fps indicator on/off   | turn fps indicator on/off
         | [ESCAPE] | quit to AmigaOS             | go to splash screen


--------------------------------------------------------------------------------
MISCELLANEOUS

* The staggered lines shift the odd lines by 1 SHRES pixel to the right. On
  systems which handle SHRES correctly, that will reduce the jailbars effect
  (but give the screen a kind of wavy look). On system which handle SHRES as
  HIRES (for example: MNT's VA2000 graphics card and Irix Labs' ScanPlus AGA -
  contrary to how is was originally marketed - display only the even or odd
  columns of pixels, so only reds and blues or greens and grays show), that
  helps improving the colors a bit (giving the screen a kind of scanline
  effect). On other systems, the results are unpredictable, but the option is
  still worth a try.
* The number shown in the top-left corner of the voxel screen is the fps
  indicator, which reports the number of frames rendered in the last second.
* The map wraps around at its edges.


--------------------------------------------------------------------------------
BENCHMARK

The performance of graphics rendering can be measured by means of the command
line RUNBENCHMARK option. Measuring the performance allows to find the best
settings for any given machine.

On 68030 machines, the best settings can be found by running PVE from shell as
follows:
 > PVE RUNBENCHMARK
 > PVE RUNBENCHMARK CACHESWITCHING

On 68040 and 68060 machines,the best settings can be found by running PVE from
shell as follows (between parentheses are the shortened forms):
 > PVE RUNBENCHMARK
 > PVE RUNBENCHMARK CACHECOPYBACK
 > PVE RUNBENCHMARK CACHESWITCHING
 > PVE RUNBENCHMARK CACHECOPYBACK CACHESWITCHING

The benchmark makes PVE render 256 frames while rotating the camera by 360°,
quit to AmigaOS and print the results to the standard output as follows:
 * number of frames rendered;
 * elapsed time in seconds;
 * number of frames rendered per second.

During the benchmark, nothing shows. The elapsed time depends on the power of
the machine. On very slow machines, it might take quite a while (e.g. on a
machine that renders at 4 fps, the duration will be 256/4 = 64 seconds).

This table shows the results of various benchmarks expressed in fps.

                            |           DATE CACHE MODE           |
------+---------------------+--------+--------+---------+---------+-----
AMIGA | EXPANSION BOARD     |      D |      C |       S |     C+S | NOTE
------+---------------------+--------+--------+---------+---------+-----
1200  | ?                   |  6.401 |      - |       - |       - | 1
1200  | Blizzard 1230 IV    | 21.129 |      - |  21.241 |       - | 2
1200  | Blizzard 1260       | 11.770 | 11.770 |  29.047 |         | 3
1200  | PiStorm32           | 78.120 | 78.768 | 127.936 |         | 4
1200  | TerribleFire TF1260 | 10.094 |  9.612 |  28.122 |         | 3
1200  | TerribleFire TF1260 | 17.610 | 16.835 |  48.448 |         | 5
3000+ | BFG9060             | 11.004 | 11.004 |  31.011 |         | 3
3000+ | BFG9060             | 19.114 | 19.114 |  53.422 |         | 5
4000  | Cyberstorm MK III   |        |        |         |         | 3
4000  | Warp Engine         | 12.120 | 12.120 |  36.861 |         | 6
4000T | CyberStorm PPC      | 17.611 | 17.611 |  34.641 |         | 7
CD³²  | The Beast 030       | 29.240 |      - |  29.260 |       - | 8

DATA CACHE MODE:
 D = Default (always on + writethrough)
 C = Copyback
 S = Switching

NOTE
 1. 68020 14.19 MHz, FAST RAM only
 2. 68030 50 MHz, RAM 60 ns
 3. 68060 50 MHz
 4. Raspberry Pi 3 A+
 5. 68060 100 MHz
 6. 68060 80 MHz
 7. 68060 60 MHz
 8. 68030 70 MHz, SRAM


--------------------------------------------------------------------------------
TECHNICAL NOTES

* Rendering is done by columns, from bottom to top and then left to right, but
  the data is written to FAST RAM raster sequentially (therefore, in practice,
  the raster is rotated clockwise by 90°).
* The graphics in the FAST RAM raster are rotated and copied to a PED81C raster
  in CHIP RAM while the bitplanes DMA fetch is inactive. The rotation executes
  partially/entirely (depending on the CPU) in parallel with the writes to CHIP
  RAM.
* Rendering and buffering and are totally asynchronous, so that the CPU must
  never wait and can run at full speed all the time (unless it is so fast that
  it renders the frames faster than they are shown).
* The code applies a depth of 256 steps per column, so it evaluates 256*128 =
  32768 dots per frame (and then renders only those which are actually visible).
* The screen resolution is 1020x200 SHRES pixels, which correspond to 255x200
  LORES-sized dots and to 128x200 logical dots.
* The screen resolution can be changed by redefining the width and height
  constants in the code and reassembling it.
* The program supports only maps of 1024x1024 pixels, but it can be made to
  support maps of other sizes by redefining the width and height constants in
  the code and reassembling it.
* The code is 100% assembly.
* The code is mostly optimized for 68030.
* The handling of the user input and of the camera is decoupled from the
  graphics rendering and executes every frame.
* The height of the camera adapts automatically to that of the dot in the map it
* is at, but it can be made user-controllable and its maximum value can be
  increased almost to the point that the landscape disappears at the bottom of
  the screen.
* The map color and height data are stored in separate files, but at load time
  they are merged in a single buffer consisting of <color, height> couples.
* The map requires 2 MB of FAST RAM.
* The program takes over the system entirely and returns to AmigaOS cleanly.


--------------------------------------------------------------------------------
BACKSTORY

After a hiatus from programming of several months (due to a computer-unrelated
project), I decided to finally create something for PED81C because I had made
nothing with it other than a few little examples, I wanted to test its graphical
quality and computational performance, and... I felt like having some good fun.
After some inconclusive mental wandering, the idea of making a voxel engine came
to mind for unknown reasons (I had never dabbled with voxel before).
When the engine was mature enough I decided to distribute PVE publicly (which
initially was not planned).
About the update, I fixed some palette values in a table in the documentation, added the formulas for calculating DIWSTRT, DIWSTOP, DIWHIGH, DDFSTRT and DDFSTOP to the documentation and implemented them in the AMOS Professional source code example. This is the snippet relative to the register settings:
Code:
In general, given a raster which is RASTERWIDTH dots wide and RASTERHEIGHT dots
tall, the values to write to the chipset registers in order to create a centered
screen can be calculated as follows:
 * SCREENWIDTH  = RASTERWIDTH * 8
 * SCREENHEIGHT = RASTERHEIGHT
 * DIWSTRTX     = $81 + (160 - SCREENWIDTH / 8)
 * DIWSTRTY     = $2c + (128 - SCREENHEIGHT / 2)
 * DIWSTRT      = ((DIWSTRTY & $ff) << 8) | ((DIWSTRTX + 1) & $ff)
 * DIWSTOPX     = DIWSTRTX + SCREENWIDTH / 4
 * DIWSTOPY     = DIWSTRTY + SCREENHEIGHT
 * DIWSTOP      = ((DIWSTOPY & $ff) << 8) | (DIWSTOPX & $ff)
 * DIWHIGH      = ((DIWSTOPX & $100) << 5) | (DIWSTOPY & $700) |
                  ((DIWSTRTX & $100) >> 3) | (DIWSTRTY >> 8)
 * DDFSTRT      = (DIWSTRTX - 17) / 2
 * DDFSTOP      = DDFSTRT+SCREENWIDTH / 8 - 8

Last edited by saimo; 18 December 2023 at 23:32. Reason: Updated manual text.
saimo is offline  
Old 29 November 2023, 07:01   #24
TCD
HOL/FTP busy bee
 
TCD's Avatar
 
Join Date: Sep 2006
Location: Germany
Age: 46
Posts: 31,613
I watched it yesterday evening and it's very impressive What sort of setup/machine was the video recorded on?
TCD is offline  
Old 29 November 2023, 10:38   #25
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by TCD View Post
What sort of setup/machine was the video recorded on?
It's simply a WinUAE recording: I don't have a way to capture the output of my A1200 :/
saimo is offline  
Old 29 November 2023, 10:39   #26
TCD
HOL/FTP busy bee
 
TCD's Avatar
 
Join Date: Sep 2006
Location: Germany
Age: 46
Posts: 31,613
Okay, thank you for the info
TCD is offline  
Old 29 November 2023, 17:56   #27
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,107
Looks neat on a real monitor and very distinct. Only getting around 16-20fps on my B1260/50Mhz though, slow chip mem access speed really hurts :/
paraj is offline  
Old 29 November 2023, 18:28   #28
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
@TCD

You're welcome!


@paraj

Quote:
Originally Posted by paraj View Post
Looks neat on a real monitor and very distinct.
Nice to hear! Thanks for the test and the report

Quote:
Only getting around 16-20fps on my B1260/50Mhz though, slow chip mem access speed really hurts :/
I'm baffled that its impact is such that the performance is worse than that of my Blizzard 1230 IV
That said, the FAST RAM -> CHIP RAM copy loop is fine-tuned for my card. I made tens of tests and tried the weirdest solutions, and eventually found out that the best code was this:

Code:
          move.w  #RASTERSIZE/(13*4)-1,d7
.CopyDots move.l  (a6)+,d0
          move.l  (a6)+,d1
          move.l  (a6)+,d2
          move.l  (a6)+,d3
          move.l  (a6)+,d4
          move.l  (a6)+,d5
          move.l  (a6)+,d6
          movea.l (a6)+,a0
          movea.l (a6)+,a1
          movea.l (a6)+,a2
          movea.l (a6)+,a3
          movea.l (a6)+,a4
          movea.l (a6)+,a5
          movem.l d0-d6/a0-a5,(a7)
          adda.w  #13*4,a7
          dbf     d7,.CopyDots

          rept    (RASTERSIZE//(13*4))/4
          move.l  (a6)+,(a7)+
          endr
Maybe for your card (and 68060 cards in general?) it's best to replace movem.l with a sequence of move.l? Or use move16 or a mix of the various instructions?
Do you happen to know which is the best strategy?
saimo is offline  
Old 29 November 2023, 18:39   #29
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,107
Quote:
Originally Posted by saimo View Post
@TCD

You're welcome!


@paraj


Nice to hear! Thanks for the test and the report

I'm baffled that its impact is such that the performance is worse than that of my Blizzard 1230 IV
That said, the FAST RAM -> CHIP RAM copy loop is fine-tuned for my card. I made tens of tests and tried the weirdest solutions, and eventually found out that the best code was this:

Code:
          move.w  #RASTERSIZE/(13*4)-1,d7
.CopyDots move.l  (a6)+,d0
          move.l  (a6)+,d1
          move.l  (a6)+,d2
          move.l  (a6)+,d3
          move.l  (a6)+,d4
          move.l  (a6)+,d5
          move.l  (a6)+,d6
          movea.l (a6)+,a0
          movea.l (a6)+,a1
          movea.l (a6)+,a2
          movea.l (a6)+,a3
          movea.l (a6)+,a4
          movea.l (a6)+,a5
          movem.l d0-d6/a0-a5,(a7)
          adda.w  #13*4,a7
          dbf     d7,.CopyDots

          rept    (RASTERSIZE//(13*4))/4
          move.l  (a6)+,(a7)+
          endr
Maybe for your card (and 68060 cards in general?) it's best to replace movem.l with a sequence of move.l? Or use move16 or a mix of the various instructions?
Do you happen to know which is the best strategy?

Chip writes are just slower than they need to be on my card (around 5.3M/s) regardless of what you do (don't think move16 works to chipram, but haven't tried). To maximize performance you want to do aligned long word writes to chipmem, then interleave computations that don't cause cache misses while the write(s) [up to 4] complete. There are ~30 cycles or so to do stuff per write. Obviously this isn't easy to do productively in a general case, so most of the time it's spend C2Ping "for free"
paraj is offline  
Old 30 November 2023, 00:47   #30
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by paraj View Post
Chip writes are just slower than they need to be on my card (around 5.3M/s) regardless of what you do (don't think move16 works to chipram, but haven't tried). To maximize performance you want to do aligned long word writes to chipmem, then interleave computations that don't cause cache misses while the write(s) [up to 4] complete. There are ~30 cycles or so to do stuff per write. Obviously this isn't easy to do productively in a general case, so most of the time it's spend C2Ping "for free"
According to my 20+ years old notes, my board doesn't perform much better: about 26 cycles (at the same 50 MHz clock) after a write to CHIP RAM and 5.57 MB/s with movem.l d0-d6/a0-a6,(a7) (tomorrow I hope I'll manage to perform new tests). But that's with all DMA channels off, and in this case the CHIP bus is very busy with fetching bitplanes.
Could you try the attached test program, please? It's the same program, but with the FAST RAM -> CHIP RAM copy disabled, so that the stats printed out at the end will tell us the speed of rendering alone - and thus, indirectly, the impact of the writes to CHIP RAM. After clicking the left mouse button in the splash screen all you'll see is the screen flickering madly: after a few seconds, simply click the right mouse button to put an end to the headache-inducing show.

In the meanwhile, I received the results from tests made on other 68060 boards:
* A1200 + TF1260: 14.21 fps
* A4000 + Cyberstorm MK III: 18.80 fps

Last edited by saimo; 01 December 2023 at 02:23. Reason: Removed attachment, as I provided a new test version.
saimo is offline  
Old 30 November 2023, 02:18   #31
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,171
Mmm, delicious gory details.

I'm sure I saw another recent thread where it was noted that movem was performing less well than expected on 040/060, so might there be a correlation here?
Karlos is online now  
Old 30 November 2023, 09:05   #32
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by Karlos View Post
Mmm, delicious gory details.


Quote:
I'm sure I saw another recent thread where it was noted that movem was performing less well than expected on 040/060, so might there be a correlation here?
Could be.
Anyway, in the sleepless night that followed I realized that something better than fiddling with instructions can be done. The current (triple) buffering was devised for when (i.e. initially) rendering was done directly to CHIP RAM, but now that graphics are rendered in FAST RAM that isn't optimal anymore. I'm going to rework it so that the writes to CHIP RAM happen while the bitplanes are not being displayed (more precisely, I'll have the copy start right after the last line has been displayed) - that, hopefully, will improve performance.
Also, I thought that I can unroll the core loop of the renderer a bit and still have it fit the 68020 and 68030 cache, and save about 4.5 cycles per source dot - if it works out, that should give a 0.25-0.75 fps (rough estimate) increase on my 68030 machine.
saimo is offline  
Old 30 November 2023, 13:27   #33
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,171
I've basically posted this exact thing somewhere else, but why not have a routine targeted for each CPU that you can make a realistic specific optimisation for and then just detect which CPU is in use on startup and assign the relevant function address to a pointer somewhere? Sure, doing an indirect jump to the function is going to cost a few more cycles but it's presumably nothing compared to the work in the loop itself.
Karlos is online now  
Old 30 November 2023, 13:59   #34
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by Karlos View Post
I've basically posted this exact thing somewhere else, but why not have a routine targeted for each CPU that you can make a realistic specific optimisation for and then just detect which CPU is in use on startup and assign the relevant function address to a pointer somewhere? Sure, doing an indirect jump to the function is going to cost a few more cycles but it's presumably nothing compared to the work in the loop itself.
Of course it's possible (and it's doesn't even need to be a pointer: the jump address can be directly written in the code at startup), but it's hard and boring to write optimal code without being able to test it and having to rely on reports from others. Also, before going down that route (which wasn't planned, as I thought the hurdle would have been getting a decent speed on 68030, not the other way around - in fact, apart from the code written generally with the 68030 in mind, the only real-time optimization is specific for 68030*), I want to see if the bus access optimization I suggested above produces a significant advantage (as I hope). Finally, I must admit that this was just a little experiment, so I'd be happy to just reach more or less the same speed on all CPUs. Anyway, let's see.

By the way, the loop unrolling optimization worked as expected: it provided 1 extra fps for the rendering code (22.2 -> 23.2 frames rendered per second) and a 0.7 fps overall improvement (20.2 -> 20.9 fps). It's a pity that the outer loop doesn't fit in the cache as well only by a few bytes (it's 266 bytes now).
The buffering strategy change, instead, is only on paper, and I'll be able to work on it only later - real life work got in the way :/

*The data cache burst is turned on and off as needed.

Last edited by saimo; 30 November 2023 at 18:18.
saimo is offline  
Old 30 November 2023, 17:50   #35
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,107
Quote:
Originally Posted by saimo View Post
According to my 20+ years old notes, my board doesn't perform much better: about 26 cycles (at the same 50 MHz clock) after a write to CHIP RAM and 5.57 MB/s with movem.l d0-d6/a0-a6,(a7) (tomorrow I hope I'll manage to perform new tests). But that's with all DMA channels off, and in this case the CHIP bus is very busy with fetching bitplanes.
Could you try the attached test program, please? It's the same program, but with the FAST RAM -> CHIP RAM copy disabled, so that the stats printed out at the end will tell us the speed of rendering alone - and thus, indirectly, the impact of the writes to CHIP RAM. After clicking the left mouse button in the splash screen all you'll see is the screen flickering madly: after a few seconds, simply click the right mouse button to put an end to the headache-inducing show.

In the meanwhile, I received the results from tests made on other 68060 boards:
* A1200 + TF1260: 14.21 fps
* A4000 + Cyberstorm MK III: 18.80 fps

It's only a little bit faster. 208/554 not touching controls with PVE-B, 208/498 moving a bit vs. 70/205 and 161/433 with copy active.
paraj is offline  
Old 30 November 2023, 18:17   #36
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by paraj View Post
It's only a little bit faster. 208/554 not touching controls with PVE-B, 208/498 moving a bit vs. 70/205 and 161/433 with copy active.
Thanks

Astonishing results!

Without copy:
208/554 > 18.77 fps
208/498 > 20.88 fps

With copy:
70/205 > 17.07 fps
161/433 > 18.59 fps

It looks like that the copy costs about 1.7 to 2.3 fps, which is quite similar to what I get on my machine - and this aligns well with what we said above regarding CHIP RAM access.
In other words, it's the renderer code to be slower! Now, this is weird: that it wasn't optimal was a given (I only avoided some registers conflicts, without taking into account pOEP and sOEP), but that it would perform worse is a big surprise. And there isn't much room for improvement, as the core code is only a bunch of basic instructions for a total of 46 bytes. The new minimally-unrolled loop version should help a bit. Maybe I'll give it a thought later - now I'll deal with the CHIP RAM writes / buffering.
saimo is offline  
Old 30 November 2023, 18:36   #37
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
From old tests, good c2p routine is fastest than copy from fast to chip on Cyberstorm 060 boards.
Don_Adan is offline  
Old 01 December 2023, 02:22   #38
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
New version:
* reworked buffering system, so that the FAST RAM -> CHIP RAM copy happens when there is no bitplanes fetch;
* optimized renderer core loop;
* replaced calls to exec's CacheControl() with custom code.

The new buffering strategy didn't bring the improvement I hoped for - but, still, it's an improvement.
The last change is due to the fact that, on the emulated 68060 system I've set up, CacheControl() didn't enable the branch cache and the store buffer. I don't know if it's because my installation of AmigaOS 3.9 (the same I use for the 68030, with just 680x0.library and 68060.library in place) was not sane or the function doesn't fully support the 68060 cache at all. If this was actually a problem, the performance should be better now.

Can you guys give the attached executable a shot and let me know how it runs on your 68060 machines and if it finally performs better than on my 68030, please?

This is how it looks on my machine:

[ Show youtube player ]

Notes:
* the red bar indicates the time spent with copying the rendered graphics to the video buffer in CHIP RAM;
* the scanline-ish look is due to the fact that my machine's video output goes through the ScanPlus AGA scandoubler, which does not support the SHRES resolution (it displays it as HIRES, skipping the even columns of pixels), so the program adopts a workaround to get decent (kind of) colors - the workaround consists in shifting every other line one SHRES pixel to the right, so that the even lines show greens and grays and the odd lines show reds and blues;
* as you can see, now it runs at 21 fps most of the time.

Last edited by saimo; 02 December 2023 at 23:03. Reason: Removed attachment, as I provided a newer version later.
saimo is offline  
Old 01 December 2023, 08:13   #39
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,107
About the same. Stationary: 17.4fps, Moving about: 19.3.

Last edited by paraj; 01 December 2023 at 08:33.
paraj is offline  
Old 01 December 2023, 08:23   #40
Reynolds
Alien Breeder
 
Reynolds's Avatar
 
Join Date: Dec 2007
Location: Szigetszentmiklos / Hungary
Age: 46
Posts: 1,096
Awesome stuff. looking forward to see where it ends up in several stuffs used.
Keep up the great work!
Reynolds is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
No native AGA screens on PIV since P96 v3 upgrade LoadWB support.Apps 0 30 October 2020 01:57
Extra bottom line on native screens, chipset feature or WinUAE? PeterK support.WinUAE 5 11 September 2019 21:21
My pseudo 3D jump code Brick Nash Coders. AMOS 24 03 September 2016 00:18
Chunky to Planar (C2P) -- USELESS GIMMICK?! crosis38 support.Hardware 10 09 July 2016 04:17
Pseudo Ops Viruskiller Promax request.Apps 0 28 July 2010 22:21

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 16:51.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.10362 seconds with 14 queries