Waiting 300 microseconds on a shift/load change or button shift on a CD32 gamepad? - Page 2

ross · 14 August 2021, 15:06

Quote:

Originally Posted by Toni Wilen

Usual answer is: if it depends on CPU speed and CPU is 68020+: it won't be accurate.

This is also sort of undefined behavior because it can also depend on accelerator board hardware (not just CPU) how it handles CIA accesses.

The question here is why the stock A1200 with the base 020 and no fast ram accesses the CIA at half the speed of WinUAE.
And I suppose Saimo hasn't changed the CACR settings, so icache is on by default and the code is run from there.

I think this is an interesting case

ross · 14 August 2021, 15:40

Quote:

Originally Posted by saimo

OK, then I had made a wrong assumption: given that CIAs use the E-clock for timing and given that the Blizzard 1230 IV is known to connect to the machine efficiently, I had taken for granted that 2 cycles was the limit.
Unfortunately, I have only the aforementioned Blizzard card: do you have some other card(s) and would you feel like giving the test program I attached in the previous post a go?

Two E-clock cycles aren't the limit, see for example:
https://lallafa.de/blog/2015/09/amig...st-can-you-go/

And also the emulation of an A500 with a 68k@14MHz and fast-ram I think is accurate in WinUAE.

What I don't realise is the A1200..

But now I have a doubt .. in your tests what do you mean by "stock A1200"?
Is it still the same A1200 with Blizzard off or another basic A1200 without an accelerator card and fast memory?

saimo · 14 August 2021, 16:00

Quote:

Originally Posted by ross

Two E-clock cycles aren't the limit, see for example:
https://lallafa.de/blog/2015/09/amig...st-can-you-go/

That's a lengthy read! I don't know if/when I'll be able to go through it.
Skimming quickly, I see that it's mostly oriented to writing rather than reading. At the bottom, there's a little paragraph about reading that says that, basically, the speed should be the same. But I wonder if the context of proper data transfers (instead of just testing PRA repetedly) affects timings.
By the way, this just suggested me to try my test again with DRA set entirely to output, entirely to input and in a mixed way. I don't believe it should make any difference, but anyway...
The bad news is that ealier, when I turned on the machine to make the tests, the monitor refused to come to life: I temporarily hooked the Amiga up to the monitor I use for the C64, but that one (newer and better) doesn't support the 50 Hz refresh rate (silly device... it's also a TV, but when the VGA input is selected it refuses to support any mode which doesn't match the built-in ones). This forces me to make some tests by typing blindly on the keyboard and redirecting the output to file...

Quote:

But now I have a doubt .. in your tests what do you mean by "stock A1200"?
Is it still the same A1200 with Blizzard off or another basic A1200 without an accelerator card and fast memory?

Same A1200 with the card turned off. Why, does it make any difference?

Photon · 14 August 2021, 16:25

As I see it, after setting up the read, the correct state will be available to read a while later. We don't know the while, but we think we need much less than 300 usecs (~5 scanlines). Code uses multiple reads to cause a delay and then relies on the last reading. I don't see how that must be necessary. It should be possible to do something useful for the wait period required.

tst.b ;if necessary
;..useful code that takes at least n usecs (accelerator caveat)
tst.b ;read the state

I agree that the best time to read the inputs are during the VBI. A Copper Interrupt should suffice, as long as no higher priority interrupts are... taking priority.

ross · 14 August 2021, 16:37

Quote:

Originally Posted by saimo

Same A1200 with the card turned off. Why, does it make any difference?

I have no idea, I don't know what exactly the Blizzard does when it is 'deactivated'. I would like to see the results of a truly bare machine.

If the results are confirmed, the A1200 emulation could be (slightly

) improved.

ross · 14 August 2021, 16:50

Quote:

Originally Posted by saimo

.. I see that it's mostly oriented to writing rather than reading.

No need to bother with ddr. Just use this version that write to the unused $B register

saimo · 14 August 2021, 18:30

Quote:

Originally Posted by ross

I have no idea, I don't know what exactly the Blizzard does when it is 'deactivated'. I would like to see the results of a truly bare machine.

Ah, I though you knew something I didn't.
Well, from experience, the board is dead when disabled. Only a machine reset can revive it.

I made a number of tests. I'm preparing the results...

saimo · 14 August 2021, 18:56

I should have been doing other stuff, but I got sucked into this...

I decided to write a number of tests to check various read/write combinations. The tests are based on these sequences of instructions (which are used also as labels, without "dbf"):

Code:

 * clr             dbf...
 * st              dbf...
 * tst             dbf...
 * clr clr         dbf...
 * st  st          dbf...
 * tst tst         dbf...
 * clr tst         dbf...
 * st  tst         dbf...
 * tst clr         dbf...
 * tst st          dbf...
 * clr clr tst tst dbf...
 * st  st  tst tst dbf...
 * tst tst clr clr dbf...
 * tst tst st  st  dbf...
 * clr tst st  tst dbf...
 * st  tst clr tst dbf...
 * tst clr tst st  dbf...
 * tst st  tst clr dbf...

tst is executed on PRA, clr and st are excuted on DDRA (to test also any potential effects of changing the direction of PRA).

The core loop executes 709379 times and looks like this (the actual instructions in the loop reflect the combinations posted above):

Code:

   lea.l  $bfe001,a0 ;not included in the time measurement
   lea.l  $bfe201,a1 ;not included in the time measurement
   move.l #709378,d0 ;not included in the time measurement

.l tst.b  (a0)
   st.b   (a1)
   tst.b  (a0)
   clr.b  (a1)
   dbf    d0,.l
   clr.w  d0
   subq.l #1,d0
   bpl.l  .l

This is what can be seen in the attached logs, produced again on the aforementioned A1200:
* the only case where access takes 1 E Clock cycle is clr, and only on the stock A1200 (on the Blizzard it takes 2 cycles, instead);
* all other accesses take 2 cycles;
* in some cases it might seem that combinations of reads and writes give an average access of 1.5 cycles, but I'm pretty sure that's just the effect of st, which is slower than clr and tst, and overlaps partially with the following instruction (including dbf);
* in some cases the 68030 seems to overlap the instructions slightly less efficiently than the 68020 (no, I didn't swap the CPUs around) - see the cases where the loop timing is about 3.1, 6.2 and 7.1 cycles.

Maybe later, if I can, I'll make a test where st is replaced with move.b dx,(ax).

Attached is the whole set of test programs and a script that will execute all them, producing a log. It would be very interesting to see the results of other machines.

EDIT: I have made also the following tests with move.b d0,(a1) now:

Code:

 * move      dbf...
 * move_move dbf...
 * move_tst  dbf...
 * tst_move  dbf...

I couldn't bother making the whole set of tests because these are sufficient to see that, as expected, move performs like clr (although it's a trifle faster in the move_tst and tst_move cases on 68030).

The archive attached here now contains also these new tests. And, by the way, I forgot to mention that they are for 68020+ (I made them on top of a test-bed program I use for everything else, and it's written for 68020 or better CPUs only).

(Not so) funny side note: in the meanwhile, I opened the monitor I use for the A1200 to see why it wouldn't turn on anymore; I expected to find some leaking capacitors - and indeed several capacitors belonging to the internal power supply block were bulging - but unfortunately another part of the same block suffered a much worse damage

ross · 14 August 2021, 22:50

Quote:

Originally Posted by saimo

This is what can be seen in the attached logs, produced again of the aforementioned A1200:
* the only case where access takes 1 E Clock cycle is clr, and only on the stock A1200 (on the Blizzard it takes 2 cycles, instead);
* all other accesses take 2 cycles;
* in some cases it might seem that combinations of reads and writes give an average access of 1.5 cycles, but I'm pretty sure that's just the effect of st, which is slower that clr and tst, and allows the parallel execution of the following instruction (including dbf);
* in some cases the 68030 seems to overlap the instructions slightly less efficiently than the 68020 (no, I didn't swap the CPUs around) - see the cases where the loop timing is about 3.1, 6.2 and 7.1 cycles.

Interesting

So I suppose my cia-speed_b.68k return 1 E-cycle as a result.

The results are somewhat similar to what happens with writing and reading in chip-ram, but greatly amplified due to the granularity of the clock.

There is a GAYLE document (it also deals with the synchronization of the processor to the CIA accesses as well as the generation of the E-clock) that could perhaps explain this time difference between read and write.
If all times were confirmed in other real machines then it could be useful in emulation.

saimo · 14 August 2021, 23:06

Quote:

Originally Posted by ross

Interesting

So I suppose my cia-speed_b.68k return 1 E-cycle as a result.

Nope, it returns 2 cycles on 68030 and 4 cycles on 68020.
EDIT: sorry, I hadn't noticed the second test program! I'll try it now and report back.
EDIT2: here are the results:

Code:

S) Elapsed: 1409 ms, data: 1000000 bytes, speed: 709,72 KB/s
B) Elapsed: 5614 ms, data: 1000000 bytes, speed: 178,12 KB/s

S) stock PAL A1200
B) same A1200, but with Blizzard 1230 IV (68030 at 50 MHz and 60 ns RAM) on

So, 1 cycle on the stock A1200, but 4 cycles on the Blizzard.

BTW, by coincidence you posted just when I edited my previous post: if you haven't noticed that already, give it another look to see the results with the move instruction.

ross · 14 August 2021, 23:42

Quote:

Originally Posted by saimo

EDIT2: here are the results:

Code:

S) Elapsed: 1409 ms, data: 1000000 bytes, speed: 709,72 KB/s
B) Elapsed: 5614 ms, data: 1000000 bytes, speed: 178,12 KB/s

S) stock PAL A1200
B) same A1200, but with Blizzard 1230 IV (68030 at 50 MHz and 60 ns RAM) on

So, 1 cycle on the stock A1200, but 4 cycles on the Blizzard.

Gah.. the result on the Blizzard is weird!. Someone has to explain it to me. Maybe that register is uglier than the others

Quote:

BTW, by coincidence you posted just when I edited my previous post: if you haven't noticed that already, give it another look to see the results with the move instruction.

Downloaded

EDIT:
I play hard too, attached NODMA test (but i don't really believe in it...)

saimo · 15 August 2021, 00:06

Quote:

Originally Posted by ross

The results are somewhat similar to what happens with writing and reading in chip-ram, but greatly amplified due to the granularity of the clock.

I decided to make a test based on this, knowing already that, after a write to CHIP RAM, the 68030 on the Blizzard enjoys something between 26 and 27 free CPU cycles for cached, non-memory instructions.
Now, given that the E clock cycle is 5 times slower than a color clock cycle, I'd expect the 68030 to have something between 26*5 = 130 and 27*5 = 135 free CPU cycles after a write to a CIA.
This is an example of how much dummy code can be added after a write to DDRA without affecting at all the overall execution time:

Code:

.l move.b  d0,(a1)
   moveq.l #14,d1 ;2 cycles
   moveq.l #14,d1 ;2 cycles
   moveq.l #14,d1 ;2 cycles
   moveq.l #14,d1 ;2 cycles
.d add.l   d2,d2  ;2 cycles
   dbf     d1,.d  ;6 cycles (except for the last time)
   dbf     d0,.l
   clr.w   d0
   subq.l  #1,d0
   bpl.b   .l

(The timings are relative to the CPU and cached instructions.)

For simplicity, let's ignore the initial non-cache case.
There is no instruction overlap because none of the instructions has a tail.
The moveqs take 4*2 = 8 cycles.
The inner loop takes 2+6 = 8 cycles. It executes 15 times, so it takes 8*15 = 120 cycles.
The total is thus 8+120 = 128 cycles.
Actually, though, also the outer dbf has be taken into account, as that executes in parallel as well: hence, the total is 134 cycles. Makes perfect sense.

EDIT: I forgot the following (I was too exhausted).

The loop executes 709379 times and takes 2 seconds (both with and without the dummy code).
This tells us that (cycle = E clock cycle):
* each loop takes 2 cycles (confirming the previous tests);
* the CPU manages to commit the write in 1 cycle (otherwise it wouldn't be able to execute the dummy code for 1 additional cycle);
* two consecutive writes can't happen in less than 2 cycles (as proven by the other tests, even if made with two consecutive instructions - e.g. move move).

From this, one might think:
* that the extra cycle is needed to complete the bus protocol (until that is completed, the CPU bus controller is stalled, so the CPU cannot execute another access to memory) or
* that the CIA needs the extra cycle to be able to accept the second write.
However, the tests on the stock A1200 disprove both the hypotheses: the 68020 does manage to execute clr and move (but not scc) entirely in 1 cycle (and also execute something else in between consecutive writes, like dbf)!
So, the oddity must lie in the expansion bus.

That said, please keep in mind that reads, instead, always take 2 cycles (also on the stock A1200, I mean) - again, on my machine, at least.

saimo · 15 August 2021, 10:33

Quote:

Originally Posted by ross

Gah.. the result on the Blizzard is weird!. Someone has to explain it to me. Maybe that register is uglier than the others

EDIT:
I play hard too, attached NODMA test (but i don't really believe in it...

Same result. Then, I changed $bfeb01 to $bfe201, just to verify whether the unused register has some weird side effect, but the result was still the same.
Then, it dawned on me: your program reports that the test lasted about 5.6 seconds, but it actually lasted half as much (or something: I just mentally counted the seconds)! Given that the A1200 monitor is no more, I did the test after a full system boot, which opens an Euro72 screen (so that I get to see it also on a dumb VGA monitor) and I guessed that perhaps you use the screen refresh frequency to measure the time. So, I rebooted without startup-sequence and ran all the three tests blindly, redirecting the output to file; in all cases, the results has been:

Code:

Elapsed: 2819 ms, data: 1000000 bytes, speed: 354,73 KB/s

P.S. I won't be able to make further tests today.

ross · 15 August 2021, 11:46

Quote:

Originally Posted by saimo

Given that the A1200 monitor is no more, I did the test after a full system boot, which opens an Euro72 screen (so that I get to see it also on a dumb VGA monitor) and I guessed that perhaps you use the screen refresh frequency to measure the time.

Finally it make sense

Thanks.

To have a fairly accurate time and not touch the standard timers I used the TOD-B which is locked to the horizontal frequency.
Then all I do is make a difference and calculate the time with:

divu.w	#15625*256/1000,d0	; d0=ms (pal_hfreq*scale_down/granularity)

Of course it only works in PAL mode

Therefore excellent, the $B register can be used to generate precise minimum delays of 1 E-Clock even on A1200, without worries, to be used as a sort of $1FE of the custom ones.

saimo · 15 August 2021, 13:43

Quote:

Therefore excellent, the $B register can be used to generate precise minimum delays of 1 E-Clock even on A1200, without worries, to be used as a sort of $1FE of the custom ones.

Unfortunately no, because the delay is 1 clock only on the unexpanded A1200. The only generally valid conclusion revealed by the tests done so far is that reading takes always 2 cycles (probably regardless of the register). Again, on my machine, at least.

ross · 15 August 2021, 14:15

Quote:

Originally Posted by saimo

Unfortunately no, because the delay is 1 clock only on the unexpanded A1200. The only generally valid conclusion revealed by the tests done so far is that reading takes always 2 cycles (probably regardless of the register). Again, on my machine, at least.

"to generate precise minimum delays of 1 E-Clock"
I only need a minimum

modrobert · 15 August 2021, 14:25

There seems to be enough margin to miss a pulse or two when using VBLANK interrupt as the CD32 blue button (first in the pulse train) duration is 3 cycles through the shift register (the other buttons are 2).

http://gerdkautzmann.de/cd32gamepad/cd32gamepad.html

Out of curiosity, are there any Amiga interrupts capable of triggering on pin 9 in the joystick port(s)? I assume that would be ideal, to get an interrupt when the CD32 button pulses are coming, deal with that, and then return.

ross · 15 August 2021, 15:11

Quote:

Originally Posted by modrobert

There seems to be enough margin to miss a pulse or two when using VBLANK interrupt as the CD32 blue button (first in the pulse train) duration is 3 cycles through the shift register (the other buttons are 2).

Why do you think you might miss a pulse?
The pulse train is initiated and managed by the Amiga (and could be at any time during VBI).

Quote:

Out of curiosity, are there any Amiga interrupts capable of triggering on pin 9 in the joystick port(s)? I assume that would be ideal, to get an interrupt when the CD32 button pulses are coming, deal with that, and then return.

Nope.

saimo · 15 August 2021, 15:11

Quote:

Originally Posted by ross

"to generate precise minimum delays of 1 E-Clock"
I only need a minimum

It must be a language thing. So, you mean: "minimum" = "at least". I had interpreted that as "as small as".

modrobert · 15 August 2021, 15:22

Quote:

Originally Posted by ross

Why do you think you might miss a pulse?
The pulse train is initiated and managed by the Amiga (and could be at any time during VBI).

Yes, my mistake, starting to remember now, so the Amiga activates pin 5 to get the data from CD32 controller shifted out on pin 9, right?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Shift + F3 not working in ProTracker?	h0ffman	support.WinUAE	4	06 February 2014 14:21
shift pattern	AGS	Coders. Asm / Hardware	16	16 December 2013 21:27
Dead shift keys...	clownstyle	support.Hardware	21	13 October 2013 22:30
Right Shift+Right Amiga works, but not Left shift+Left Amiga	Photon	support.WinUAE	13	22 November 2010 21:43
Sound shift	mcferson	support.WinUAE	26	15 October 2008 13:03

14 August 2021, 16:25	#24
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,602	As I see it, after setting up the read, the correct state will be available to read a while later. We don't know the while, but we think we need much less than 300 usecs (~5 scanlines). Code uses multiple reads to cause a delay and then relies on the last reading. I don't see how that must be necessary. It should be possible to do something useful for the wait period required. tst.b ;if necessary ;..useful code that takes at least n usecs (accelerator caveat) tst.b ;read the state I agree that the best time to read the inputs are during the VBI. A Copper Interrupt should suffice, as long as no higher priority interrupts are... taking priority.

15 August 2021, 14:25	#37
modrobert old bearded fool Join Date: Jan 2010 Location: Bangkok Age: 56 Posts: 775	There seems to be enough margin to miss a pulse or two when using VBLANK interrupt as the CD32 blue button (first in the pulse train) duration is 3 cycles through the shift register (the other buttons are 2). http://gerdkautzmann.de/cd32gamepad/cd32gamepad.html Out of curiosity, are there any Amiga interrupts capable of triggering on pin 9 in the joystick port(s)? I assume that would be ideal, to get an interrupt when the CD32 button pulses are coming, deal with that, and then return.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)