Trackmo tech

paraj · 19 March 2017, 20:54

In my quest to be able to compete with the brightest minds of '87, I figured being able to create a trackmo was on the required list.

Before you grab your pitchforks let me make it clear that:
1) This is purely a learning exercise, I derive enjoyment from discovering how things work
2) If I ever create something worthwhile a HD-compatible version will be available alongside
3) I'll probably use somebody else's track loading routines once I understand how they are supposed to work

I've gotten pretty far on my own with the usual resources (HRM, this forum and Photons site) and am at a point where I can take over the system from a boot block and load from the disk.
The loaded binary is just a simple self-checking program with lots of repeated data (~384k) that I used to debug the track loader.
It seems to work fine in WinUAE in all the various configurations I bothered checking (except kick13+OCS+8MB chip, but that seems to be causing me trouble with other programs as well, so I haven't looked into that yet).

With that out of the way, my questions!

Killing the system.

Following these steps am I doing something wrong/suboptimal? (it'd be nice if this worked on all Amigas, but my target is classic A500 (with support for common expansions) or thereabouts:
1) Kill any caches with CacheControl(0, -1) if exec.library v37 or later
2) Scan SysBase->MemList for the largest MEMF_FAST block (if available) - here it seems I need to round mh_lower/mh_upper in some way to get the full memory?
3) Use SysBase->MaxLocMem as end of chip mem
4) Now I need to determine where we can relocate ourselves to and find a safe place to put the stacks (user/supervisor). This is a bit iffy since we could be loaded anywhere - what's the most compatible strategy here (that's easy to implement)? I currently go for the top of fast ram (if available) otherwise top of chip.
4) In a Supervisor()-call I take over the system by disabling interrupts/dma, moving vbr to 0 (if 68010+) and switch stacks.
5) I now copy the rest of the boot loader to the new location and jump there.

Track loading.

I don't expect anybody else to subject their poor Amigas to my track loading code (and unfortunately I don't have a real one to test on at the moment), so I'm a restricted in how well I can test compatibility, so I'd appreciate some input on how I can avoid the usual pitfalls and ensure a better chance of actually running on the real hardware.
In another thread Toni Wilen mentions that waiting a little while before virtually inserting the disk in WinUAE slightly alters the timing - more tips like this would be great.
For timing I take a cue from Photon's example source (http://coppershade.org/asmskool/SOUR...%23MFMloader.S) and use vhposr for timing - That should be OK?
Stepping in the same direction requires 3ms of waiting - easy enough, but HRM mentions DSKSTEP needing to be pulsed, Photon's source does two NOPs between toggling the bits - is this the preferred method? I read vhposr figuring that should work even with an accelerator, is this better/worse?
HRM states that 15ms is needed after a direction change - this is from the last step pulse? Writing this I'm beginning to suspect my code may only work by accident (would a violation be detectable in WinUAE?)

If I got the code for reading the TOD counter in CIAA right I'm able to load 769 sectors in 843 vblanks (~17s) giving a transfer speed of about 23k/s.
I couldn't find much in the way of hard numbers for what kind of speeds to expect, but since I don't think I'll be going for overlapping DMA and MFM decoding in a boot block, there probably won't be any record broken, but it'd be nice to know if it's acceptable compared to other loaders.
For a trackmo I envision having some disk loader code potentially running every vblank for simplicity rather than going for max disk throughput, so max speed is only of theoretical interest.

Phew, that turned out a lot longer than I expected. Thanks to those who bothered reading even part of it

TL;DR: how2kill system but keep its brain/when do I need to wait 15ms for a floppy direction change/what are the normal loading speeds from floppy?

losso · 20 March 2017, 18:45

In case you haven't seen it, the sources of Planet Rocklobster include the trackmo framework and build tools. The track loading routines are based Photon's work as well.

A lot of stuff is happening early in the bootblock, e.g. just querying and then allocating the largest chip/fast block, then setting up a custom memory management on top of that and use that for the stack allocation, the where-to-load-ourselves region etc.

The VBR is used as-is, not set to 0. CacheControl is invoked later, like you described. Supervisor() is only used when resetting.

paraj · 20 March 2017, 20:07

Quote:

Originally Posted by losso

In case you haven't seen it, the sources of Planet Rocklobster include the trackmo framework and build tools. The track loading routines are based Photon's work as well.

A lot of stuff is happening early in the bootblock, e.g. just querying and then allocating the largest chip/fast block, then setting up a custom memory management on top of that and use that for the stack allocation, the where-to-load-ourselves region etc.

The VBR is used as-is, not set to 0. CacheControl is invoked later, like you described. Supervisor() is only used when resetting.

Hadn't seen that actually, good description of the effects too. It also highlights why I think I need to compete with 1987 instead of 2015

Thanks for the pointer!

The boot loader code looks familiar, I think I may have stumbled on it (or a variation) while searching for info on amiga boot loaders... It may have been the same link actually, but I was probably too focused on the task at hand to notice.

As far as I can tell their loader is pretty system friendly - allocating the correct way and not overwriting any memory used by the system, so they wouldn't need to re-locate the VBR. Instead it seems they rely on VBR always being zero to save a couple of instructions (they have commented out code to read it). Doesn't seem like a totally unreasonable assumption (and they probably checked most popular kickstart versions), so I'll keep that in mind when I need to shave off a few bytes

phx · 30 March 2017, 19:49

Quote:

Originally Posted by paraj

1) Kill any caches with CacheControl(0, -1) if exec.library v37 or later

I wouldn't necessarily turn the caches off. When you write all the code yourself then you should know it is clean and that it works with caches enabled.

Maybe determining the CPU type could be useful. This is usually the first I do, before reading VBR, touching the caches or initialize the MMU.

Be prepared for really strange 68060 boards. Last year I had a tester with A1200/060 where Sqrxz4 hung in the boot-loader. It turned out that VBR was NOT zero (during boot!) and that the MMU already had a full configuration, which I had to disable and replace by a simple transparent translation setting (for Chip-RAM/Fast-RAM, ignoring Zorro space).
I have a A3000 and A4000 with CSPPC/060 myself. But I have never seen that before!

Quote:

4) Now I need to determine where we can relocate ourselves to and find a safe place to put the stacks (user/supervisor). This is a bit iffy since we could be loaded anywhere - what's the most compatible strategy here (that's easy to implement)?

I might suggest vlink's "rawseg" output format, because I implemented it for easily relocating my games (Solid Gold, Sqrxz). It uses a linker script to define segments (e.g. a Chip-RAM and a Fast-RAM segment), and writes the output for each segment into a different file. Additionally, when the -q (keep relocations) option is given, the absolute addresses in the linker script are ignored and 32-bit relocation tables for each combination of segments are written into additional files. The format of each file is: number of relocations, followed by relocation offsets. For example:

Code:

vlink -brawseg -q -minalign 2 -Tgame.ld -o sqrxz4raw main.o interrupt.o game.o tiles.o map.o bob.o blit.o hero.o monster.o animation.o platform.o crate.o text.o scroll.o txtscroll.o statusbar.o logo.o font.o background.o display.o input.o music.o sound.o story.o menu.o ptplayer.o trackdisk.o hiscores.o memory.o debug.o end.o
Devel:games/Sqrxz4OCS> list sqrxz4raw#?
sqrxz4raw.fast.relfast             500 ----rwed Today     19:13:43
sqrxz4raw.fast.relchip            1408 ----rwed Today     19:13:43
sqrxz4raw.fast                  116796 ----rwed Today     19:13:43
sqrxz4raw.chip                  399932 ----rwed Today     19:13:43
sqrxz4raw                           63 ----rwed Today     19:13:43

*.fast is the binary output for the fast-segment and *.chip for the chip-segment. There are no relocations in Chip-RAM in this case, so we have only *.fast.relfast for the offsets where the base address of the fast-segment has to be added, and *.fast.relchip to add the chip base-address for all offsets therein. Using this linker script:

Code:

PHDRS {
    fast PT_LOAD;
    chip PT_LOAD;
}

SECTIONS {
    . = 0;
    .chip: {
        *(chipmem)
        . = ALIGN(2);
        _FREECHIP = .;
    } :chip

    . = 0x100000;
    .text: { *(CODE) } :fast
    .data: { *(DATA) }
    .sdata: {
        _LinkerDB = . + 0x7ffe;
        _SDA_BASE_ = . + 0x7ffe;
        *(.sdata __MERGED)
    }
    .bss: { *(BSS) }
}

May look more complicated than it is.

Quote:

For timing I take a cue from Photon's example source (http://coppershade.org/asmskool/SOUR...%23MFMloader.S) and use vhposr for timing - That should be OK?

Yes. I'm also doing my timing for trackdisk routines with VPOS.

Quote:

Stepping in the same direction requires 3ms of waiting - easy enough, but HRM mentions DSKSTEP needing to be pulsed, Photon's source does two NOPs between toggling the bits - is this the preferred method? I read vhposr figuring that should work even with an accelerator, is this better/worse?

No VPOS-based waiting should be needed. In my experience NOPs are absolutely sufficient. Reading/writing CIA registers automatically slows down the CPU to the Chip-RAM bus speed. I'm doing BCLR-NOP-BSET for such a step-pulse. In my tests it worked with all hardware up to the fastest 060 machines.

Quote:

I couldn't find much in the way of hard numbers for what kind of speeds to expect

I never really cared. Loading from disk is slow anyway, and optimizations don't make much difference. The optimal case would be to load a new track for every rotation and decode the last track in the meantime, while waiting for the Disk-DMA to finish.

I just start reading a track, wait for DMA to finish, then decode the blocks I need.

paraj · 30 March 2017, 20:57

Quote:

Originally Posted by phx

I wouldn't necessarily turn the caches off. When you write all the code yourself then you should know it is clean and that it works with caches enabled.

While I don't have a complete use case thought out for the loader, I'm thinking it'd only be relevant for very low-end Amigas anyway. My first Amiga was an unexpanded KS1.2 A500 so that might be my target (but requiring 512K slow RAM might be needed to make something interesting). So I'm not really that concerned about getting better performance on much later systems, only making sure I maximize the chance it'll still work if somebody is interested enough to run my stuff on a real 'miga

(myself included once I get it up and running again...)

Quote:

Originally Posted by phx

Maybe determining the CPU type could be useful. This is usually the first I do, before reading VBR, touching the caches or initialize the MMU.

Be prepared for really strange 68060 boards. Last year I had a tester with A1200/060 where Sqrxz4 hung in the boot-loader. It turned out that VBR was NOT zero (during boot!) and that the MMU already had a full configuration, which I had to disable and replace by a simple transparent translation setting (for Chip-RAM/Fast-RAM, ignoring Zorro space).
I have a A3000 and A4000 with CSPPC/060 myself. But I have never seen that before!

Exactly the insight I was hoping for - the instruction words used to be compatible with systems where VBR<>0 on startup were not in vain!

Quote:

Originally Posted by phx

I might suggest vlink's "rawseg" output format, because I implemented it for easily relocating my games (Solid Gold, Sqrxz). It uses a linker script to define segments (e.g. a Chip-RAM and a Fast-RAM segment), and writes the output for each segment into a different file. Additionally, when the -q (keep relocations) option is given, the absolute addresses in the linker script are ignored and 32-bit relocation tables for each combination of segments are written into additional files. The format of each file is: number of relocations, followed by relocation offsets. For example:

Code:

vlink -brawseg -q -minalign 2 -Tgame.ld -o sqrxz4raw main.o interrupt.o game.o tiles.o map.o bob.o blit.o hero.o monster.o animation.o platform.o crate.o text.o scroll.o txtscroll.o statusbar.o logo.o font.o background.o display.o input.o music.o sound.o story.o menu.o ptplayer.o trackdisk.o hiscores.o memory.o debug.o end.o
Devel:games/Sqrxz4OCS> list sqrxz4raw#?
sqrxz4raw.fast.relfast             500 ----rwed Today     19:13:43
sqrxz4raw.fast.relchip            1408 ----rwed Today     19:13:43
sqrxz4raw.fast                  116796 ----rwed Today     19:13:43
sqrxz4raw.chip                  399932 ----rwed Today     19:13:43
sqrxz4raw                           63 ----rwed Today     19:13:43

*.fast is the binary output for the fast-segment and *.chip for the chip-segment. There are no relocations in Chip-RAM in this case, so we have only *.fast.relfast for the offsets where the base address of the fast-segment has to be added, and *.fast.relchip to add the chip base-address for all offsets therein. Using this linker script:

Code:

PHDRS {
    fast PT_LOAD;
    chip PT_LOAD;
}

SECTIONS {
    . = 0;
    .chip: {
        *(chipmem)
        . = ALIGN(2);
        _FREECHIP = .;
    } :chip

    . = 0x100000;
    .text: { *(CODE) } :fast
    .data: { *(DATA) }
    .sdata: {
        _LinkerDB = . + 0x7ffe;
        _SDA_BASE_ = . + 0x7ffe;
        *(.sdata __MERGED)
    }
    .bss: { *(BSS) }
}

May look more complicated than it is.

Looks like standard ld script syntax to me, the only frightening thing is having to look up the symbols in the manual

I've been playing around with various approaches on my own and ended up with exactly this approach, just using my own hacky hunk->files converter. Nice that vlink supports this directly! Guess I should have read the manual, but since this is a hobby project I prefer just playing around with some code

EDIT: My concern about where to put the stacks is that I'm starting to use absolute memory addresses before relocating - I could potentially be overwriting where we're running (even if the probability is remote)

Quote:

Originally Posted by phx

Yes. I'm also doing my timing for trackdisk routines with VPOS.

No VPOS-based waiting should be needed. In my experience NOPs are absolutely sufficient. Reading/writing CIA registers automatically slows down the CPU to the Chip-RAM bus speed. I'm doing BCLR-NOP-BSET for such a step-pulse. In my tests it worked with all hardware up to the fastest 060 machines.

I never really cared. Loading from disk is slow anyway, and optimizations don't make much difference. The optimal case would be to load a new track for every rotation and decode the last track in the meantime, while waiting for the Disk-DMA to finish.

I just start reading a track, wait for DMA to finish, then decode the blocks I need.

Excellent feedback, thank you! For the NOP/VHPOSR question, I'm not doing a wait for a certain raster line, rather I'm just reading a custom register once (but not doing any looping), but that doesn't matter as you answered my question fully

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Tech AMIGA magazine	thinlega	request.Apps	9	19 February 2021 17:26
LN2 tech demo	gimbal	project.Amiga Game Factory	63	02 October 2008 20:22
The 50 Best Tech Products of All Time	rbelk	Retrogaming General Discussion	19	04 April 2007 18:25
help with Tech running in WinUAE	redblade	support.Games	9	17 April 2004 02:26

20 March 2017, 18:45	#2
losso Registered User Join Date: Oct 2013 Location: Hamburg Posts: 69	In case you haven't seen it, the sources of Planet Rocklobster include the trackmo framework and build tools. The track loading routines are based Photon's work as well. A lot of stuff is happening early in the bootblock, e.g. just querying and then allocating the largest chip/fast block, then setting up a custom memory management on top of that and use that for the stack allocation, the where-to-load-ourselves region etc. The VBR is used as-is, not set to 0. CacheControl is invoked later, like you described. Supervisor() is only used when resetting.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)