fast HAM8 conversion ? - Page 5

Thorham · 29 November 2007, 15:51

To meynaf:

- I saw both moveq.b and moveq.l - as moveq has no size, giving it one could be misleading
1. Moveq has no size? Man, I didn't know that

- move.b (a6),d0 followed by move.b d0,(a1)+ can be replaced by move.b (a6),(a1)+
2. Forgot to include it

- is move.l (sp),a5 really faster than move.l #adr,a5 ?
3. It seems to be. I thought: lets try it, and the number of frames dropped by one. I know the bench program has variable output, but the number of frames can go as low as 138, while without the address on the stack it doesn't want to go lower then 139.

Anyway, It's still faster, though, and there maybe more of these smaller optimizations. Hope you approve of the ham8 table becoming 16 times as large...

To Kalms:

Wow, 85% huh? Thats awful. I suspected some penalty's, but nothing this serious. I've tried it and the speed for 1280x1024 24bit input (scaling factor 33%x50%) with the screen on is 99 frames. With the screen off it's 72 frames! That's optimizing.

Thanks, mate, greatly appreciated

To everyone:

Does anyone know the bandwith ratings for chip mem in different modes?

meynaf · 29 November 2007, 16:40

Quote:

Originally Posted by Thorham

To meynaf:

- I saw both moveq.b and moveq.l - as moveq has no size, giving it one could be misleading
1. Moveq has no size? Man, I didn't know that

Oh sorry. I didn't know you weren't the one who wrote moveq.b and moveq.l in the code.

Quote:

Originally Posted by Thorham

- move.b (a6),d0 followed by move.b d0,(a1)+ can be replaced by move.b (a6),(a1)+
2. Forgot to include it

You may also want to change the table of indexes by a table of addresses in the palette, like I did in the version I posted here.

Quote:

Originally Posted by Thorham

- is move.l (sp),a5 really faster than move.l #adr,a5 ?
3. It seems to be. I thought: lets try it, and the number of frames dropped by one. I know the bench program has variable output, but the number of frames can go as low as 138, while without the address on the stack it doesn't want to go lower then 139.

I'll run my "execute this code 50,000,000 times and gimme the timing" program once more to learn more about this... I suspect that the situation would be reversed if the data cache was turned off.

Quote:

Originally Posted by Thorham

Anyway, It's still faster, though, and there maybe more of these smaller optimizations.

I dunno what could the removal of that lsr instruction gain, but one frame is 0.02 seconds. Not worth using 64k (well, ok, 60k) more mem, so I didn't remove the lsr instruction in my version.

Speaking about small opts, what do you think of :

Code:

move.l #addr,a5
move.l (a5,d4.l*4),a6

vs :

Code:

move.l (addr,d4.l*4),a6

Quote:

Originally Posted by Thorham

Hope you approve of the ham8 table becoming 16 times as large...

No.

Quote:

Originally Posted by Thorham

To Kalms:

Wow, 85% huh? Thats awful. I suspected some penalty's, but nothing this serious. I've tried it and the speed for 1280x1024 24bit input (scaling factor 33%x50%) with the screen on is 99 frames. With the screen off it's 72 frames! That's optimizing.

Thanks, mate, greatly appreciated

To everyone:

Does anyone know the bandwith ratings for chip mem in different modes?

SHRES mode with 8 bpp uses 100% chipmem bandwidth as long as something is displayed, so you end up with 85% (but just imagine what it can give with maximum overscan

)

I don't know the exact values for each mode, but you can guess what it can be by simply looking at the screen size in bytes.

Thorham · 29 November 2007, 19:46

Quote:

Originally Posted by meynaf

Oh sorry. I didn't know you weren't the one who wrote moveq.b and moveq.l in the code.

I was! Wat I meant was that I didn't know moveq has a fixed size. Really, I'm the one who wrote that code

It's a bit odd that assemblers actually accept non-existing opcodes

Quote:

Originally Posted by meynaf

You may also want to change the table of indexes by a table of addresses in the palette, like I did in the version I posted here.

That's a good idea, didn't spot it in your new code!

Quote:

Originally Posted by meynaf

I'll run my "execute this code 50,000,000 times and gimme the timing" program once more to learn more about this... I suspect that the situation would be reversed if the data cache was turned off.

That's a handy program to have. I'm writing my own version straightaway. Not having to lookup the cycle times for instructions is pretty handy.

Quote:

Originally Posted by meynaf

I dunno what could the removal of that lsr instruction gain, but one frame is 0.02 seconds. Not worth using 64k (well, ok, 60k) more mem, so I didn't remove the lsr instruction in my version.

By itself not much, but in combination with the other optimizations, about seven more frames have been chopped off, for a total of about 16 frames less then the original.
Also, I never think twice about sacrificing some memory to speed things up

Quote:

Originally Posted by meynaf

Speaking about small opts, what do you think of :

move.l #addr,a5
move.l (a5,d4.l*4),a6

vs :

move.l (addr,d4.l*4),a6

I didn't know that addressing mode, either

Looks like the single instruction is going to be faster. When I'm done with my own version of your 'run this code 50.000.000 times and give me the timing' program, I'll check it out.

Quote:

Originally Posted by meynaf

No.

Point taken

But I will keep it in my version

Quote:

Originally Posted by meynaf

SHRES mode with 8 bpp uses 100% chipmem bandwidth as long as something is displayed, so you end up with 85% (but just imagine what it can give with maximum overscan

)

I don't know the exact values for each mode, but you can guess what it can be by simply looking at the screen size in bytes.

This has given me an idea for another speed test program, which you can use to determine the base bandwidth (bitplane dma off) and the bandwith for selectable modes with the dma on. Maybe I'm going to make a list for all the 15khz modes, and post it.

Kalms · 29 November 2007, 20:18

Regarding bitplane DMA fetching:

Assume AGA hardware, and FMODE=$f.

The hardware fetches bitplane data in the following order:

1 5 2 6 3 7 4 8
or
1 5 3 7 2 6 4 8
(it doesn't matter which one, but it's one of the two)

One fetch period takes 8 chipbus cycles. Fetches for planes 1-4 happen during cycles which the CPU don't have access to; fetches for planes 5-8 happen during the CPU's cycles.

So, 1..4bpl do not steal any buscycles from the CPU, it's only planes 5-8 which do that.

In LORES, the fetch period is run once every 32 cycles.
In HIRES, the fetch period is run once every 16 cycles.
In SHRES, the fetch period is run once every 8 cycles (that is, continuously).

For a 320 pixels wide LORES scanline, the bitplane DMA needs to do 5 fetch periods (64 pixels per fetch period). HIRES, twice as many pixels -> 10 fetch periods. SHRES -> 20 fetch periods.

--------------------

One non-interlace frame is 228*313 = 71364 buscycles. Half of those are available to the CPU, so that means 35682 buscycles to the CPU.

Each 8bpl fetch period will steal 4 buscycles from the CPU. A 1280x256 display will thus steal 4*20*256 = 20480 buscycles.

Under ideal conditions, that leaves the CPU with 35682 - 20480 = 15202 = 42% of the total buscycles. However, other DMA activity eat some more of the bandwidth... and the accelerator cards' interface to chipmem don't manage to sustain full throughput.

If you time, remember to make sure that everything else (including copper, sprite and audio DMA) is turned off. And remember that figures will vary between different accelerator boards.

Kalms · 29 November 2007, 20:22

Speaking of instruction-level optimizations -- if you are timing on 68030, it is important that your loops fit within the 256-byte instruction cache. If your loop is X bytes, where X > 256, then the processor will have to read 2*(X-256) bytes extra from fastmem for each iteration through the loop.

If you are very close to the 256-byte limit, you may need to 16-byte-align the code at runtime: the ultimate criterion is that the loop must fit within the 16 available cachelines, and cachelines are 16-byte-aligned on the 020/030. [A CNOP 0,16 will not do because the AmigaOS hunk loader/memory allocator will only 8-byte align hunks during loading.]

Some optimizations (such as "move.l #imm32,dn" vs "move.l (an),dn") trade data memory access for code size; that might be the reason why you're sometimes getting counterintuitive results.

meynaf · 30 November 2007, 10:36

Quote:

Originally Posted by Thorham

I was! Wat I meant was that I didn't know moveq has a fixed size. Really, I'm the one who wrote that code

It's a bit odd that assemblers actually accept non-existing opcodes

Unless it has been corrected, phxass also accepts things such as nop d0

The problem of moveq is that it is half .b and half .l : it takes a byte and converts it to a long... So accepting both .b and .l is not that stupid

But, as I stated, it's misleading : moveq.b #n,dn is NOT equivalent to move.b #n,dn.

Quote:

Originally Posted by Thorham

That's a good idea, didn't spot it in your new code!

It multiplies the color table by 4, so I ended up with 16k. I considered it as a good deal

Quote:

Originally Posted by Thorham

That's a handy program to have. I'm writing my own version straightaway. Not having to lookup the cycle times for instructions is pretty handy.

The advantage of it is that you can not only test individual instructions, but groups of them. And because of the pipeline (and superscalar on 060) it can change a lot by simply reordering them !

Quote:

Originally Posted by Thorham

By itself not much, but in combination with the other optimizations, about seven more frames have been chopped off, for a total of about 16 frames less then the original.
Also, I never think twice about sacrificing some memory to speed things up

I personnally won't go above a few kb for just a little gain. But I started to code on 8-bit computers with 64 kb or ram, so...

Quote:

Originally Posted by Thorham

I didn't know that addressing mode, either

Looks like the single instruction is going to be faster. When I'm done with my own version of your 'run this code 50.000.000 times and give me the timing' program, I'll check it out.

You've got the habit of plain 68000, don't you ?

On 020+ you can even do that :
move.l (d0.l),d1
but it will be slower than :
move.l d0,a0
move.l (a0),d1

That's why I asked the question. Sometimes several instructions are faster than just one. I'll have the answer tomorrow...

Quote:

Originally Posted by Thorham

Point taken

But I will keep it in my version

Even if it makes you use 256kb for just one or two frames ???

Quote:

Originally Posted by Thorham

This has given me an idea for another speed test program, which you can use to determine the base bandwidth (bitplane dma off) and the bandwith for selectable modes with the dma on. Maybe I'm going to make a list for all the 15khz modes, and post it.

Even with all dma turned off, chipmem isn't fast

To Kalms : the loop is variable, depending of the chosen pixel type, but it is 242 bytes at most in my actual version. This should fit ?

Thorham · 30 November 2007, 10:37

Kalms, your dma fetch reply is most enlightening

The timing program I'm going to write is going to run with no dma on only for the base bandwith calculation. For actual screen modes it's not per say, as I want to be able to control which dma is on or off (to emulate real life 'hit the hardware' situations). I do have to admit that being able to time with just the bpl dma on would be handy, so it's going to be included (although I always use the copper to handle screens, which I know you don't have to).

Indeed, loops that fit inside the i-cache completely, are much faster, but I didn't know about the 16 byte alignment, so I'm definitely going to take that in account for everything I'm writing from now on. Thanks again!

Thorham · 30 November 2007, 11:10

Quote:

Originally Posted by meynaf

Unless it has been corrected, phxass also accepts things such as nop d0

The problem of moveq is that it is half .b and half .l : it takes a byte and converts it to a long... So accepting both .b and .l is not that stupid

But, as I stated, it's misleading : moveq.b #n,dn is NOT equivalent to move.b #n,dn.

It is misleading, and I'm not ever writing it again!

Quote:

Originally Posted by meynaf

It multiplies the color table by 4, so I ended up with 16k. I considered it as a good deal

Yes, it's a good deal. In fact, it would be silly not to do it that way.

Quote:

Originally Posted by meynaf

The advantage of it is that you can not only test individual instructions, but groups of them. And because of the pipeline (and superscalar on 060) it can change a lot by simply reordering them !

That is a major advantage, and something that becomes a nuisance with just individual opcode timing tables. With those, there are also other factors to take in account, and with a timing program you an do what ever you want. A good idea indeed.

Quote:

Originally Posted by meynaf

I personnally won't go above a few kb for just a little gain. But I started to code on 8-bit computers with 64 kb or ram, so...

So did I. I started assembler programming with a Final Cartridge III (if I remember the name correctly) on the Commodore 64. Since my amiga has 16mb of fast mem, I usually don't care about some memory overhead (unless it becomes whole megabytes), and I'm more concerned with speed.

Quote:

Originally Posted by meynaf

You've got the habit of plain 68000, don't you ?

On 020+ you can even do that :
move.l (d0.l),d1
but it will be slower than :
move.l d0,a0
move.l (a0),d1

That's why I asked the question. Sometimes several instructions are faster than just one. I'll have the answer tomorrow...

Yup, you got that right

Started miggy coding on the good old a500 (where memory was tight, too). I've actually never gotten out of the habit of only using 68000 instructions.

About several instrucions being faster then one: It only occurred to me for multiplications, never for anything else. Since the stuff I write usually isn't suitable for a plain 68000 machine anyway, I'm going to have to do some serious brushing-up of my 680x0 knowledge soon...

Quote:

Originally Posted by meynaf

Even if it makes you use 256kb for just one or two frames ???

Yes, even then. But only if this happens once or twice. Ending up with whole megabytes of overhead just for some small speed gains is unacceptable
to me, too. In this case, it's happens just once, making it acceptable to me.

Quote:

Originally Posted by meynaf

Even with all dma turned off, chipmem isn't fast

You can say that again! It's a big shame they didn't make it any faster. On the other hand, it does force you to be more creative then on the pc, where just about everything is fast. It's all that speed which takes the fun out of coding. No matter how practical pc's are, they are just plain boring

meynaf · 30 November 2007, 11:53

Quote:

Originally Posted by Thorham

Since my amiga has 16mb of fast mem, I usually don't care about some memory overhead (unless it becomes whole megabytes), and I'm more concerned with speed.

Anyway if it becomes whole megabytes you lose the gain : those megabytes need to be filled/read before used... and this takes time !
But I also want my viewer to be usable on a plain A1200 without fastmem, so I took care of the memory. My viewer usually reads the whole image in memory, but you can give it a maxmem argument to reduce the amount.

Else on my 32mb a1200 i am very rarely out of memory

Quote:

Originally Posted by Thorham

About several instrucions being faster then one: It only occurred to me for multiplications, never for anything else.

Just an example :
add.l d0,a0
move.l (a0),d1
is actually faster than
move.l (a0,d0.l),d1
(probably not true on 040 though, needs to be checked)

For a single instruction that's rare, but small code isn't always faster than big code.

Quote:

Originally Posted by Thorham

Since the stuff I write usually isn't suitable for a plain 68000 machine anyway, I'm going to have to do some serious brushing-up of my 680x0 knowledge soon...

The additions on the 68020 were good, but 040/060 add nothing useful at all imho ('xcept speed of course

).
The one I prefer isn't a new instruction/addressing mode, but just the ability to access words/longs at unaligned addresses.

Quote:

Originally Posted by Thorham

No matter how practical pc's are, they are just plain boring

+1

Thorham · 30 November 2007, 12:40

Quote:

Originally Posted by meynaf

Anyway if it becomes whole megabytes you lose the gain : those megabytes need to be filled/read before used... and this takes time !
But I also want my viewer to be usable on a plain A1200 without fastmem, so I took care of the memory. My viewer usually reads the whole image in memory, but you can give it a maxmem argument to reduce the amount.

Else on my 32mb a1200 i am very rarely out of memory

Exactly my thought. I ran into this problem when writing a fast tokenizer using a hash table. Now that hash table was fine (256kb, and never gets read compleatly, nor does it get filled up completely), but the code also generated megabytes of string table stuff, just so that the code would fit in the cache. Needless to say I should have written that completely different.

Getting code to work properly on a plain A1200 is cool, and for those machines it might still be a good idea to only read a small part of the image and render it, then read some more. Just a thought.

Running out of non-chip memory on amigas is quite a hard thing to do with 32mb. I have 'only' 16mb, and have only used vmm once or twice!

Quote:

Originally Posted by meynaf

Just an example :
add.l d0,a0
move.l (a0),d1
is actually faster than
move.l (a0,d0.l),d1
(probably not true on 040 though, needs to be checked)

For a single instruction that's rare, but small code isn't always faster than big code.

I have something like that in the scaling loop of my example bmp viewer (which I'm going to make into a use full, all round bmp viewer). Would it still be faster if you'd added a sub.l d0,a0 after it? It's needed because I've run out of registers...

Quote:

Originally Posted by meynaf

The additions on the 68020 were good, but 040/060 add nothing useful at all imho ('xcept speed of course

).
The one I prefer isn't a new instruction/addressing mode, but just the ability to access words/longs at unaligned addresses.

Yeah, that is a good addition, because on 68000 word/long access had to be word aligned, not exactly convenient.

Quote:

Originally Posted by meynaf

+1

This speaks for it self

meynaf · 30 November 2007, 13:45

Quote:

Originally Posted by Thorham

Exactly my thought. I ran into this problem when writing a fast tokenizer using a hash table. Now that hash table was fine (256kb, and never gets read compleatly, nor does it get filled up completely), but the code also generated megabytes of string table stuff, just so that the code would fit in the cache. Needless to say I should have written that completely different.

If you discover that the code doesn't fit in the caches, and if you have slow instructions in it (like mul/div, but no memory accesses), then you can put them in the beginning of your loop. This way, during the computation the caches are refilled.
Just my 2 cents...

Quote:

Originally Posted by Thorham

Getting code to work properly on a plain A1200 is cool, and for those machines it might still be a good idea to only read a small part of the image and render it, then read some more. Just a thought.

That's exactly what I did with the "maxmem" argument in my viewer. The buffer is read from the file, and the image is decoded up to the point the decoder runs out of data. Then more data is read.
You may have a look at the "input" routine in v.s to check how it's done.

Quote:

Originally Posted by Thorham

Running out of non-chip memory on amigas is quite a hard thing to do with 32mb. I have 'only' 16mb, and have only used vmm once or twice!

With 16mb you can do everything, however web browsers can easily eat them.
And if, like I do, you make an extensive use of the ram disk, then even 32mb can be quickly filled

Quote:

Originally Posted by Thorham

I have something like that in the scaling loop of my example bmp viewer (which I'm going to make into a use full, all round bmp viewer). Would it still be faster if you'd added a sub.l d0,a0 after it? It's needed because I've run out of registers...

It may still be faster, but this needs to be checked...

If you make a full bmp viewer then I'm interested. This would make a neat addition to my actual viewer.

Quote:

Originally Posted by Thorham

Yeah, that is a good addition, because on 68000 word/long access had to be word aligned, not exactly convenient.

Sure. You simply have not to abuse of it because thoses accesses are slower than aligned ones (but they are still faster than several instructions doing the same thing).

On 020+ I also like the scale factor in indexed modes and the long versions of mul & div.

Thorham · 30 November 2007, 18:20

Quote:

Originally Posted by meynaf

If you discover that the code doesn't fit in the caches, and if you have slow instructions in it (like mul/div, but no memory accesses), then you can put them in the beginning of your loop. This way, during the computation the caches are refilled.
Just my 2 cents...

And if you don't have expensive instructions, it's better to forget about the whole i-cache and just rely on the algorithm's speed. Sometimes it just can't be helped. This was the case with the tokenizer, which I actually haven't re-written (for an experimental compiler, which I'm now just doing in C)

Quote:

Originally Posted by meynaf

That's exactly what I did with the "maxmem" argument in my viewer. The buffer is read from the file, and the image is decoded up to the point the decoder runs out of data. Then more data is read.
You may have a look at the "input" routine in v.s to check how it's done.

Mis-interpreted that one, I was thinking something completely different

Quote:

Originally Posted by meynaf

With 16mb you can do everything, however web browsers can easily eat them.
And if, like I do, you make an extensive use of the ram disk, then even 32mb can be quickly filled

Browsers do use a lot of memory, so I do my browsing with the peecee, it's just more comfortable. Also the amiga os browsers I have seen (in the past), aren't up to the modern standards, I wonder if they are today. Wouldn't actually matter, as I don't have a network on my amiga right now (and the serial/parallel ports are a no go area, as they are a million times to slow).

Quote:

Originally Posted by meynaf

It may still be faster, but this needs to be checked...

If you make a full bmp viewer then I'm interested. This would make a neat addition to my actual viewer.

Well, I'm thinking about it. After seeing how fast and easy my current code makes viewing large images, I really should. It would actually be pretty crazy not to. I only have to get the grips with intuition coding, so it's not even an unrealistic thought.

Quote:

Originally Posted by meynaf

Sure. You simply have not to abuse of it because thoses accesses are slower than aligned ones (but they are still faster than several instructions doing the same thing).

On 020+ I also like the scale factor in indexed modes and the long versions of mul & div. Today 12:40

The scale factors are handy, and I think the scaling is free of charge! While the 32bit mul/div opcodes are nice to have, they are slower then their 16bit equivalents. Sometimes they are useful, plain and simple. You know what it's like to have to make do without them:

Anyway, I hope you like the example bmp viewer, but I'm sure you'll agree it can be optimized, and written better in places... And of course, it's all in plain 68000

Kalms · 01 December 2007, 11:12

meynaf: 242 bytes loop should fit, yes.

The best-case scenario is when the loop-start is at offset 0 (modulo 16); then the last byte of the loop code lies at $f2.

The worst-case scenario is when the loop-start is at offset $e (modulo 16); then the last byte of loop code lies at $ff.

However, I don't know whether the 030 CPU will attempt to prefetch the instruction following a branch before the branch condition has been evaluated. (I don't think it will, but I'm not 100% sure.) If it does that, then you need to ensure that the instruction following the branch also fits into the cache.

AmigaOS's memory allocator always returns 8-byte aligned blocks, so CNOP 0,8 will successfully 8-byte align code/data in your program. By 8-byte-aligning the loopstart, you allow yourself to use loops that are up to 248 bytes long.

As for
add.l d0,a0
move.l (a0),d1
vs
move.l (a0,d0.l),d1

-- for 040+, the general rule is to not touch any of the registers involved in an EA computation during 2-3 cycles before the instruction that references memory. So both versions above are (roughly) equally fast; what matters more is whether you can move the updates for a0 (and d0 as well in the latter case) a few lines up in the code sequence.

Thorham · 01 December 2007, 13:06

Quote:

Originally Posted by Kalms

AmigaOS's memory allocator always returns 8-byte aligned blocks, so CNOP 0,8 will successfully 8-byte align code/data in your program. By 8-byte-aligning the loopstart, you allow yourself to use loops that are up to 248 bytes long.

You could just have your own code handle the 16 byte alignment by simply copying the loop to a 16 byte aligned address. May require some fiddling, but should still work in a lot of cases. That way, your loop can be 256 bytes long.

Kalms · 01 December 2007, 14:08

Indeed.

Thorham · 01 December 2007, 14:59

But what about the data cache, i-burst, and d-burst ???

Kalms · 01 December 2007, 16:29

Ok, I'll bite.

You usually want to have I-burst on. The only case where I-burst off might be sensible is when your execution path is made up of a lot of small fragments of code (like a binary search tree implemented in code) and the fragments themselves are noticeably smaller than a cacheline themselves, and/or not cacheline-aligned.

D-burst on/off is only a question for 68030. On 030, if D-burst is off, the CPU will only fetch the requested longword(s); if D-burst is on, the entire cacheline will be loaded.
On 040+ (with the default MMU configurations), the CPU will always fetch the entire cacheline, and D-burst controls whether it will be through four individual 32bit memory access cycles, or whether it will be through a 4-beat burst transfer. Burst transfer is never slower than 4 individual accesses.
Optimizing for 030 with D-burst on is similar to optimizing for 040+.

I used to run with D-burst on at all times on 030 because if tuned, algorithms would then usually run quicker than with D-burst off.

[The bits about the 68030 are from memory, so you should better validate it against the 68030UM before taking it as gospel]
When D-burst on 68030 is off, a cache read miss causes the following effect:
* A new cacheline is allocated. If the cacheline is already allocated to another 16-byte region, it is invalidated.
* The CPU stalls until the longword has been fetched from main memory. The longword will be stored in the allocated cacheline, along with a tag in the cacheline which indicates which of the longwords is valid.

When D-burst on 68030 is on, a cache read miss causes the following effect:
* A new cacheline is allocated. The contents of the entire cacheline is invalidated.
* The CPU stalls until the first longword has been fetched from main memory. After that, the CPU continues execution.
* The bus controller fetches 4 longwords and stores them into the cache. Once finished, the entire cacheline will contain valid data.
* If the CPU tries to perform any sort of memory access before the cache line has been fully loaded, it will stall until the cacheline has been filled.

On 68030, writes will always generate memory accesses (the data cache is "Write-through"). If you read and write to the same place in memory, the writes might invalidate cache contents for matching cache lines. See 68030UM for details. In short, don't read and write to the same place.

On 68040+, a cache read miss causes the following effect:
* A new cacheline is allocated (and its contents is invalidated).
* The bus controller fetches 4 longwords and stores them into the cache. Once finished, the entire cacheline will contain valid data.
* The CPU stalls until the first longword has been fetched from main memory. After that, the CPU continues execution.
* If the CPU tries to access external memory, or the cacheline that is being loaded, before the cache line has been fully loaded, it will stall until the cacheline has been filled. The CPU can however access other cachelines while the current one is being filled.

68040+ can also cache writes. A cache write miss causes the following effect:
* A new cacheline is allocated & invalidated.
* The bus controller reads the cacheline in.
* One longword in the cacheline is replaced by the data that the CPU tried to write.
I'm not sure how long the CPU is stalled during the write-miss process.

Misaligned accesses (i.e. reads/writes that straddle longword boundaries) make the analysis a bit more complicated.

Thorham · 01 December 2007, 17:52

Wow, Kalms, thanks

This stuff is very interesting, and I'm going to have a very thorough look at it. Taking all this in account when programming seems a little hard, but maybe not when you're used to it. Since I'm so used to plain 68000, this would be a great addition to a bunch of new and improved coding habits!

And I was kinda hoping you'd bite

Thanks again

meynaf · 03 December 2007, 10:41

I see things have advanced this week-end...

Quote:

Originally Posted by Thorham

And if you don't have expensive instructions, it's better to forget about the whole i-cache and just rely on the algorithm's speed. Sometimes it just can't be helped. This was the case with the tokenizer, which I actually haven't re-written (for an experimental compiler, which I'm now just doing in C)

For a compiler, the tokenizer isn't really the time-critical part. I think the best practice is to simply make the code as small as possible.

Quote:

Originally Posted by Thorham

Mis-interpreted that one, I was thinking something completely different

What were you thinking then ?

Quote:

Originally Posted by Thorham

Browsers do use a lot of memory, so I do my browsing with the peecee, it's just more comfortable. Also the amiga os browsers I have seen (in the past), aren't up to the modern standards, I wonder if they are today. Wouldn't actually matter, as I don't have a network on my amiga right now (and the serial/parallel ports are a no go area, as they are a million times to slow).

Browsers on Amiga aren't up to the modern standards, so far not, although they're fast for what they do. But, no, I will not use my old 56k modem again if I can do otherwise

Quote:

Originally Posted by Thorham

Well, I'm thinking about it. After seeing how fast and easy my current code makes viewing large images, I really should. It would actually be pretty crazy not to. I only have to get the grips with intuition coding, so it's not even an unrealistic thought.

If it would actually be pretty crazy not to do something, then don't do it.

(sorry I just couldn't resist

)

As for intuition coding, everything is done within my library, I'll nevermore call OpenScreenTagList/OpenWindowTagList directly...
You can look at it, or, simpler, just use it. Opening an intuition screen is bloody easy.

Quote:

Originally Posted by Thorham

The scale factors are handy, and I think the scaling is free of charge! While the 32bit mul/div opcodes are nice to have, they are slower then their 16bit equivalents. Sometimes they are useful, plain and simple. You know what it's like to have to make do without them:

But are the bitfield instructions handy ? (ok, bfffo saved me once, but...)
Are cas/cas2/chk2 really useful ?

Quote:

Originally Posted by Thorham

Anyway, I hope you like the example bmp viewer, but I'm sure you'll agree it can be optimized, and written better in places... And of course, it's all in plain 68000

I'll do mine if I've got enough time. Making a bmp.s for my viewer is actually pretty easy : it needs only to check if the file format is a bmp, get the image dimensions, and feed the data "as is" to the rendering engine...

Quote:

Originally Posted by Thorham

You could just have your own code handle the 16 byte alignment by simply copying the loop to a 16 byte aligned address. May require some fiddling, but should still work in a lot of cases. That way, your loop can be 256 bytes long.

I would personnally never do that. Looks like bad practice to me. Evil.

To Kalms : feel free to bite whenever you want.

You got it right with the burst stuff.
When it's inactive, we're fetching per 4 bytes.
When it's active, we're fetching per 16 bytes. Slower than one longword access, but faster than 4 times a longword access.
So it's good for consecutive memory accesses, e.g. for code or for a copymem. But I personnally leave it disabled for data, as it's better for everyday use.

Thorham · 03 December 2007, 11:38

Quote:

Originally Posted by meynaf

What were you thinking then ?

Something very silly, indeed. I thought it a was just a mem use limiter, it didn't occur to me it still read everything (told you it was a silly thought)

Quote:

Originally Posted by meynaf

As for intuition coding, everything is done within my library, I'll nevermore call OpenScreenTagList/OpenWindowTagList directly...
You can look at it, or, simpler, just use it. Opening an intuition screen is bloody easy.

Thanks

I might just do that. First I want to know exactly how opening screens works, because, if at all possible, I want to understand the code I use. Thats why my c2p routine is unoptimized, I don't completely understand all the optimizations.

Quote:

Originally Posted by meynaf

But are the bitfield instructions handy ? (ok, bfffo saved me once, but...)
Are cas/cas2/chk2 really useful ?

Going to have to look these up

Quote:

Originally Posted by meynaf

I'll do mine if I've got enough time. Making a bmp.s for my viewer is actually pretty easy : it needs only to check if the file format is a bmp, get the image dimensions, and feed the data "as is" to the rendering engine...

That's interesting. It would add functionality to your viewer, and save me time doing one from scratch (more or less). I'm going to see if I can write a proper bmp.s for your source code.

Quote:

Originally Posted by meynaf

I would personnally never do that. Looks like bad practice to me. Evil.

Actually it's not. Code which modifies itself during runtime is evil. Copying code around at the init stage of a program is not, as long a the code itself stays the same. It's just like runtime code generation where you generate a jump table each time the running program gets new data. AmigaOs actually does it's code relocating by modifying the code itself. Since it happens before the code is executed, this is fine.

Sorry about the archive I've uploaded being rather big. The three bmps I've included are rather large (in particular the 1280x1024 one: Almost 4mb). And I just couldn't resist adding some extras, IrfanView's ability to display iff files properly is just all to handy, and it contains a very flexible batch conversion feature, so I included it.

01 December 2007, 11:12	#93
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	meynaf: 242 bytes loop should fit, yes. The best-case scenario is when the loop-start is at offset 0 (modulo 16); then the last byte of the loop code lies at $f2. The worst-case scenario is when the loop-start is at offset $e (modulo 16); then the last byte of loop code lies at $ff. However, I don't know whether the 030 CPU will attempt to prefetch the instruction following a branch before the branch condition has been evaluated. (I don't think it will, but I'm not 100% sure.) If it does that, then you need to ensure that the instruction following the branch also fits into the cache. AmigaOS's memory allocator always returns 8-byte aligned blocks, so CNOP 0,8 will successfully 8-byte align code/data in your program. By 8-byte-aligning the loopstart, you allow yourself to use loops that are up to 248 bytes long. As for add.l d0,a0 move.l (a0),d1 vs move.l (a0,d0.l),d1 -- for 040+, the general rule is to not touch any of the registers involved in an EA computation during 2-3 cycles before the instruction that references memory. So both versions above are (roughly) equally fast; what matters more is whether you can move the updates for a0 (and d0 as well in the latter case) a few lines up in the code sequence. Last edited by Kalms; 01 December 2007 at 11:17.

01 December 2007, 16:29	#97
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	Ok, I'll bite. You usually want to have I-burst on. The only case where I-burst off might be sensible is when your execution path is made up of a lot of small fragments of code (like a binary search tree implemented in code) and the fragments themselves are noticeably smaller than a cacheline themselves, and/or not cacheline-aligned. D-burst on/off is only a question for 68030. On 030, if D-burst is off, the CPU will only fetch the requested longword(s); if D-burst is on, the entire cacheline will be loaded. On 040+ (with the default MMU configurations), the CPU will always fetch the entire cacheline, and D-burst controls whether it will be through four individual 32bit memory access cycles, or whether it will be through a 4-beat burst transfer. Burst transfer is never slower than 4 individual accesses. Optimizing for 030 with D-burst on is similar to optimizing for 040+. I used to run with D-burst on at all times on 030 because if tuned, algorithms would then usually run quicker than with D-burst off. [The bits about the 68030 are from memory, so you should better validate it against the 68030UM before taking it as gospel] When D-burst on 68030 is off, a cache read miss causes the following effect: * A new cacheline is allocated. If the cacheline is already allocated to another 16-byte region, it is invalidated. * The CPU stalls until the longword has been fetched from main memory. The longword will be stored in the allocated cacheline, along with a tag in the cacheline which indicates which of the longwords is valid. When D-burst on 68030 is on, a cache read miss causes the following effect: * A new cacheline is allocated. The contents of the entire cacheline is invalidated. * The CPU stalls until the first longword has been fetched from main memory. After that, the CPU continues execution. * The bus controller fetches 4 longwords and stores them into the cache. Once finished, the entire cacheline will contain valid data. * If the CPU tries to perform any sort of memory access before the cache line has been fully loaded, it will stall until the cacheline has been filled. On 68030, writes will always generate memory accesses (the data cache is "Write-through"). If you read and write to the same place in memory, the writes might invalidate cache contents for matching cache lines. See 68030UM for details. In short, don't read and write to the same place. On 68040+, a cache read miss causes the following effect: * A new cacheline is allocated (and its contents is invalidated). * The bus controller fetches 4 longwords and stores them into the cache. Once finished, the entire cacheline will contain valid data. * The CPU stalls until the first longword has been fetched from main memory. After that, the CPU continues execution. * If the CPU tries to access external memory, or the cacheline that is being loaded, before the cache line has been fully loaded, it will stall until the cacheline has been filled. The CPU can however access other cachelines while the current one is being filled. 68040+ can also cache writes. A cache write miss causes the following effect: * A new cacheline is allocated & invalidated. * The bus controller reads the cacheline in. * One longword in the cacheline is replaced by the data that the CPU tried to write. I'm not sure how long the CPU is stalled during the write-miss process. Misaligned accesses (i.e. reads/writes that straddle longword boundaries) make the analysis a bit more complicated. Last edited by Kalms; 01 December 2007 at 18:30. Reason: typos

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
HAM8 screen question.	Thorham	Coders. General	28	04 April 2011 19:26
HAM8 C2P Hacking	NovaCoder	Coders. General	2	25 March 2010 10:37
Problem making ham8 icons.	Thorham	support.Apps	0	12 March 2008 22:30
Multiple HAM8 pictures?	killergorilla	support.Other	4	15 February 2007 14:41

29 November 2007, 15:51	#81
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,753	To meynaf: - I saw both moveq.b and moveq.l - as moveq has no size, giving it one could be misleading 1. Moveq has no size? Man, I didn't know that - move.b (a6),d0 followed by move.b d0,(a1)+ can be replaced by move.b (a6),(a1)+ 2. Forgot to include it - is move.l (sp),a5 really faster than move.l #adr,a5 ? 3. It seems to be. I thought: lets try it, and the number of frames dropped by one. I know the bench program has variable output, but the number of frames can go as low as 138, while without the address on the stack it doesn't want to go lower then 139. Anyway, It's still faster, though, and there maybe more of these smaller optimizations. Hope you approve of the ham8 table becoming 16 times as large... To Kalms: Wow, 85% huh? Thats awful. I suspected some penalty's, but nothing this serious. I've tried it and the speed for 1280x1024 24bit input (scaling factor 33%x50%) with the screen on is 99 frames. With the screen off it's 72 frames! That's optimizing. Thanks, mate, greatly appreciated To everyone: Does anyone know the bandwith ratings for chip mem in different modes?

29 November 2007, 20:18	#84
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	Regarding bitplane DMA fetching: Assume AGA hardware, and FMODE=$f. The hardware fetches bitplane data in the following order: 1 5 2 6 3 7 4 8 or 1 5 3 7 2 6 4 8 (it doesn't matter which one, but it's one of the two) One fetch period takes 8 chipbus cycles. Fetches for planes 1-4 happen during cycles which the CPU don't have access to; fetches for planes 5-8 happen during the CPU's cycles. So, 1..4bpl do not steal any buscycles from the CPU, it's only planes 5-8 which do that. In LORES, the fetch period is run once every 32 cycles. In HIRES, the fetch period is run once every 16 cycles. In SHRES, the fetch period is run once every 8 cycles (that is, continuously). For a 320 pixels wide LORES scanline, the bitplane DMA needs to do 5 fetch periods (64 pixels per fetch period). HIRES, twice as many pixels -> 10 fetch periods. SHRES -> 20 fetch periods. -------------------- One non-interlace frame is 228313 = 71364 buscycles. Half of those are available to the CPU, so that means 35682 buscycles to the CPU. Each 8bpl fetch period will steal 4 buscycles from the CPU. A 1280x256 display will thus steal 420*256 = 20480 buscycles. Under ideal conditions, that leaves the CPU with 35682 - 20480 = 15202 = 42% of the total buscycles. However, other DMA activity eat some more of the bandwidth... and the accelerator cards' interface to chipmem don't manage to sustain full throughput. If you time, remember to make sure that everything else (including copper, sprite and audio DMA) is turned off. And remember that figures will vary between different accelerator boards.

29 November 2007, 20:22	#85
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	Speaking of instruction-level optimizations -- if you are timing on 68030, it is important that your loops fit within the 256-byte instruction cache. If your loop is X bytes, where X > 256, then the processor will have to read 2*(X-256) bytes extra from fastmem for each iteration through the loop. If you are very close to the 256-byte limit, you may need to 16-byte-align the code at runtime: the ultimate criterion is that the loop must fit within the 16 available cachelines, and cachelines are 16-byte-aligned on the 020/030. [A CNOP 0,16 will not do because the AmigaOS hunk loader/memory allocator will only 8-byte align hunks during loading.] Some optimizations (such as "move.l #imm32,dn" vs "move.l (an),dn") trade data memory access for code size; that might be the reason why you're sometimes getting counterintuitive results.

30 November 2007, 10:37	#87
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,753	Kalms, your dma fetch reply is most enlightening The timing program I'm going to write is going to run with no dma on only for the base bandwith calculation. For actual screen modes it's not per say, as I want to be able to control which dma is on or off (to emulate real life 'hit the hardware' situations). I do have to admit that being able to time with just the bpl dma on would be handy, so it's going to be included (although I always use the copper to handle screens, which I know you don't have to). Indeed, loops that fit inside the i-cache completely, are much faster, but I didn't know about the 16 byte alignment, so I'm definitely going to take that in account for everything I'm writing from now on. Thanks again!

01 December 2007, 14:08	#95
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	Indeed.

01 December 2007, 14:59	#96
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,753	But what about the data cache, i-burst, and d-burst ???

01 December 2007, 17:52	#98
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,753	Wow, Kalms, thanks This stuff is very interesting, and I'm going to have a very thorough look at it. Taking all this in account when programming seems a little hard, but maybe not when you're used to it. Since I'm so used to plain 68000, this would be a great addition to a bunch of new and improved coding habits! And I was kinda hoping you'd bite Thanks again

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)