29 November 2007, 15:51 | #81 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,753
|
To meynaf:
- I saw both moveq.b and moveq.l - as moveq has no size, giving it one could be misleading 1. Moveq has no size? Man, I didn't know that - move.b (a6),d0 followed by move.b d0,(a1)+ can be replaced by move.b (a6),(a1)+ 2. Forgot to include it - is move.l (sp),a5 really faster than move.l #adr,a5 ? 3. It seems to be. I thought: lets try it, and the number of frames dropped by one. I know the bench program has variable output, but the number of frames can go as low as 138, while without the address on the stack it doesn't want to go lower then 139. Anyway, It's still faster, though, and there maybe more of these smaller optimizations. Hope you approve of the ham8 table becoming 16 times as large... To Kalms: Wow, 85% huh? Thats awful. I suspected some penalty's, but nothing this serious. I've tried it and the speed for 1280x1024 24bit input (scaling factor 33%x50%) with the screen on is 99 frames. With the screen off it's 72 frames! That's optimizing. Thanks, mate, greatly appreciated To everyone: Does anyone know the bandwith ratings for chip mem in different modes? |
29 November 2007, 16:40 | #82 | ||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
Quote:
Quote:
Quote:
Speaking about small opts, what do you think of : Code:
move.l #addr,a5 move.l (a5,d4.l*4),a6 Code:
move.l (addr,d4.l*4),a6 Quote:
Quote:
I don't know the exact values for each mode, but you can guess what it can be by simply looking at the screen size in bytes. |
||||||
29 November 2007, 19:46 | #83 | |||||||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,753
|
Quote:
It's a bit odd that assemblers actually accept non-existing opcodes Quote:
Quote:
Quote:
Also, I never think twice about sacrificing some memory to speed things up Quote:
Quote:
Quote:
Last edited by Thorham; 29 November 2007 at 19:48. Reason: Correction |
|||||||
29 November 2007, 20:18 | #84 |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
Regarding bitplane DMA fetching:
Assume AGA hardware, and FMODE=$f. The hardware fetches bitplane data in the following order: 1 5 2 6 3 7 4 8 or 1 5 3 7 2 6 4 8 (it doesn't matter which one, but it's one of the two) One fetch period takes 8 chipbus cycles. Fetches for planes 1-4 happen during cycles which the CPU don't have access to; fetches for planes 5-8 happen during the CPU's cycles. So, 1..4bpl do not steal any buscycles from the CPU, it's only planes 5-8 which do that. In LORES, the fetch period is run once every 32 cycles. In HIRES, the fetch period is run once every 16 cycles. In SHRES, the fetch period is run once every 8 cycles (that is, continuously). For a 320 pixels wide LORES scanline, the bitplane DMA needs to do 5 fetch periods (64 pixels per fetch period). HIRES, twice as many pixels -> 10 fetch periods. SHRES -> 20 fetch periods. -------------------- One non-interlace frame is 228*313 = 71364 buscycles. Half of those are available to the CPU, so that means 35682 buscycles to the CPU. Each 8bpl fetch period will steal 4 buscycles from the CPU. A 1280x256 display will thus steal 4*20*256 = 20480 buscycles. Under ideal conditions, that leaves the CPU with 35682 - 20480 = 15202 = 42% of the total buscycles. However, other DMA activity eat some more of the bandwidth... and the accelerator cards' interface to chipmem don't manage to sustain full throughput. If you time, remember to make sure that everything else (including copper, sprite and audio DMA) is turned off. And remember that figures will vary between different accelerator boards. |
29 November 2007, 20:22 | #85 |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
Speaking of instruction-level optimizations -- if you are timing on 68030, it is important that your loops fit within the 256-byte instruction cache. If your loop is X bytes, where X > 256, then the processor will have to read 2*(X-256) bytes extra from fastmem for each iteration through the loop.
If you are very close to the 256-byte limit, you may need to 16-byte-align the code at runtime: the ultimate criterion is that the loop must fit within the 16 available cachelines, and cachelines are 16-byte-aligned on the 020/030. [A CNOP 0,16 will not do because the AmigaOS hunk loader/memory allocator will only 8-byte align hunks during loading.] Some optimizations (such as "move.l #imm32,dn" vs "move.l (an),dn") trade data memory access for code size; that might be the reason why you're sometimes getting counterintuitive results. |
30 November 2007, 10:36 | #86 | |||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
The problem of moveq is that it is half .b and half .l : it takes a byte and converts it to a long... So accepting both .b and .l is not that stupid But, as I stated, it's misleading : moveq.b #n,dn is NOT equivalent to move.b #n,dn. It multiplies the color table by 4, so I ended up with 16k. I considered it as a good deal Quote:
Quote:
Quote:
On 020+ you can even do that : move.l (d0.l),d1 but it will be slower than : move.l d0,a0 move.l (a0),d1 That's why I asked the question. Sometimes several instructions are faster than just one. I'll have the answer tomorrow... Even if it makes you use 256kb for just one or two frames ??? Quote:
To Kalms : the loop is variable, depending of the chosen pixel type, but it is 242 bytes at most in my actual version. This should fit ? |
|||||
30 November 2007, 10:37 | #87 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,753
|
Kalms, your dma fetch reply is most enlightening
The timing program I'm going to write is going to run with no dma on only for the base bandwith calculation. For actual screen modes it's not per say, as I want to be able to control which dma is on or off (to emulate real life 'hit the hardware' situations). I do have to admit that being able to time with just the bpl dma on would be handy, so it's going to be included (although I always use the copper to handle screens, which I know you don't have to). Indeed, loops that fit inside the i-cache completely, are much faster, but I didn't know about the 16 byte alignment, so I'm definitely going to take that in account for everything I'm writing from now on. Thanks again! |
30 November 2007, 11:10 | #88 | |||||||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,753
|
Quote:
Quote:
Quote:
Quote:
Quote:
About several instrucions being faster then one: It only occurred to me for multiplications, never for anything else. Since the stuff I write usually isn't suitable for a plain 68000 machine anyway, I'm going to have to do some serious brushing-up of my 680x0 knowledge soon... Quote:
to me, too. In this case, it's happens just once, making it acceptable to me. Quote:
|
|||||||
30 November 2007, 11:53 | #89 | |||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
But I also want my viewer to be usable on a plain A1200 without fastmem, so I took care of the memory. My viewer usually reads the whole image in memory, but you can give it a maxmem argument to reduce the amount. Else on my 32mb a1200 i am very rarely out of memory Quote:
add.l d0,a0 move.l (a0),d1 is actually faster than move.l (a0,d0.l),d1 (probably not true on 040 though, needs to be checked) For a single instruction that's rare, but small code isn't always faster than big code. Quote:
The one I prefer isn't a new instruction/addressing mode, but just the ability to access words/longs at unaligned addresses. +1 |
|||
30 November 2007, 12:40 | #90 | ||||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,753
|
Quote:
Getting code to work properly on a plain A1200 is cool, and for those machines it might still be a good idea to only read a small part of the image and render it, then read some more. Just a thought. Running out of non-chip memory on amigas is quite a hard thing to do with 32mb. I have 'only' 16mb, and have only used vmm once or twice! Quote:
Quote:
Quote:
|
||||
30 November 2007, 13:45 | #91 | |||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
Just my 2 cents... Quote:
You may have a look at the "input" routine in v.s to check how it's done. Quote:
And if, like I do, you make an extensive use of the ram disk, then even 32mb can be quickly filled Quote:
If you make a full bmp viewer then I'm interested. This would make a neat addition to my actual viewer. Quote:
On 020+ I also like the scale factor in indexed modes and the long versions of mul & div. |
|||||
30 November 2007, 18:20 | #92 | |||||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,753
|
Quote:
Quote:
Quote:
Quote:
Quote:
Anyway, I hope you like the example bmp viewer, but I'm sure you'll agree it can be optimized, and written better in places... And of course, it's all in plain 68000 |
|||||
01 December 2007, 11:12 | #93 |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
meynaf: 242 bytes loop should fit, yes.
The best-case scenario is when the loop-start is at offset 0 (modulo 16); then the last byte of the loop code lies at $f2. The worst-case scenario is when the loop-start is at offset $e (modulo 16); then the last byte of loop code lies at $ff. However, I don't know whether the 030 CPU will attempt to prefetch the instruction following a branch before the branch condition has been evaluated. (I don't think it will, but I'm not 100% sure.) If it does that, then you need to ensure that the instruction following the branch also fits into the cache. AmigaOS's memory allocator always returns 8-byte aligned blocks, so CNOP 0,8 will successfully 8-byte align code/data in your program. By 8-byte-aligning the loopstart, you allow yourself to use loops that are up to 248 bytes long. As for add.l d0,a0 move.l (a0),d1 vs move.l (a0,d0.l),d1 -- for 040+, the general rule is to not touch any of the registers involved in an EA computation during 2-3 cycles before the instruction that references memory. So both versions above are (roughly) equally fast; what matters more is whether you can move the updates for a0 (and d0 as well in the latter case) a few lines up in the code sequence. Last edited by Kalms; 01 December 2007 at 11:17. |
01 December 2007, 13:06 | #94 | |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,753
|
Quote:
|
|
01 December 2007, 14:08 | #95 |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
Indeed.
|
01 December 2007, 14:59 | #96 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,753
|
But what about the data cache, i-burst, and d-burst ???
|
01 December 2007, 16:29 | #97 |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
Ok, I'll bite.
You usually want to have I-burst on. The only case where I-burst off might be sensible is when your execution path is made up of a lot of small fragments of code (like a binary search tree implemented in code) and the fragments themselves are noticeably smaller than a cacheline themselves, and/or not cacheline-aligned. D-burst on/off is only a question for 68030. On 030, if D-burst is off, the CPU will only fetch the requested longword(s); if D-burst is on, the entire cacheline will be loaded. On 040+ (with the default MMU configurations), the CPU will always fetch the entire cacheline, and D-burst controls whether it will be through four individual 32bit memory access cycles, or whether it will be through a 4-beat burst transfer. Burst transfer is never slower than 4 individual accesses. Optimizing for 030 with D-burst on is similar to optimizing for 040+. I used to run with D-burst on at all times on 030 because if tuned, algorithms would then usually run quicker than with D-burst off. [The bits about the 68030 are from memory, so you should better validate it against the 68030UM before taking it as gospel] When D-burst on 68030 is off, a cache read miss causes the following effect: * A new cacheline is allocated. If the cacheline is already allocated to another 16-byte region, it is invalidated. * The CPU stalls until the longword has been fetched from main memory. The longword will be stored in the allocated cacheline, along with a tag in the cacheline which indicates which of the longwords is valid. When D-burst on 68030 is on, a cache read miss causes the following effect: * A new cacheline is allocated. The contents of the entire cacheline is invalidated. * The CPU stalls until the first longword has been fetched from main memory. After that, the CPU continues execution. * The bus controller fetches 4 longwords and stores them into the cache. Once finished, the entire cacheline will contain valid data. * If the CPU tries to perform any sort of memory access before the cache line has been fully loaded, it will stall until the cacheline has been filled. On 68030, writes will always generate memory accesses (the data cache is "Write-through"). If you read and write to the same place in memory, the writes might invalidate cache contents for matching cache lines. See 68030UM for details. In short, don't read and write to the same place. On 68040+, a cache read miss causes the following effect: * A new cacheline is allocated (and its contents is invalidated). * The bus controller fetches 4 longwords and stores them into the cache. Once finished, the entire cacheline will contain valid data. * The CPU stalls until the first longword has been fetched from main memory. After that, the CPU continues execution. * If the CPU tries to access external memory, or the cacheline that is being loaded, before the cache line has been fully loaded, it will stall until the cacheline has been filled. The CPU can however access other cachelines while the current one is being filled. 68040+ can also cache writes. A cache write miss causes the following effect: * A new cacheline is allocated & invalidated. * The bus controller reads the cacheline in. * One longword in the cacheline is replaced by the data that the CPU tried to write. I'm not sure how long the CPU is stalled during the write-miss process. Misaligned accesses (i.e. reads/writes that straddle longword boundaries) make the analysis a bit more complicated. Last edited by Kalms; 01 December 2007 at 18:30. Reason: typos |
01 December 2007, 17:52 | #98 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,753
|
Wow, Kalms, thanks
This stuff is very interesting, and I'm going to have a very thorough look at it. Taking all this in account when programming seems a little hard, but maybe not when you're used to it. Since I'm so used to plain 68000, this would be a great addition to a bunch of new and improved coding habits! And I was kinda hoping you'd bite Thanks again |
03 December 2007, 10:41 | #99 | |||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
I see things have advanced this week-end...
Quote:
Quote:
Quote:
Quote:
(sorry I just couldn't resist ) As for intuition coding, everything is done within my library, I'll nevermore call OpenScreenTagList/OpenWindowTagList directly... You can look at it, or, simpler, just use it. Opening an intuition screen is bloody easy. Quote:
Are cas/cas2/chk2 really useful ? Quote:
Quote:
To Kalms : feel free to bite whenever you want. You got it right with the burst stuff. When it's inactive, we're fetching per 4 bytes. When it's active, we're fetching per 16 bytes. Slower than one longword access, but faster than 4 times a longword access. So it's good for consecutive memory accesses, e.g. for code or for a copymem. But I personnally leave it disabled for data, as it's better for everyday use. |
|||||||
03 December 2007, 11:38 | #100 | |||||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,753
|
Quote:
Quote:
Quote:
Quote:
Quote:
Sorry about the archive I've uploaded being rather big. The three bmps I've included are rather large (in particular the 1280x1024 one: Almost 4mb). And I just couldn't resist adding some extras, IrfanView's ability to display iff files properly is just all to handy, and it contains a very flexible batch conversion feature, so I included it. |
|||||
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
HAM8 screen question. | Thorham | Coders. General | 28 | 04 April 2011 19:26 |
HAM8 C2P Hacking | NovaCoder | Coders. General | 2 | 25 March 2010 10:37 |
Problem making ham8 icons. | Thorham | support.Apps | 0 | 12 March 2008 22:30 |
Multiple HAM8 pictures? | killergorilla | support.Other | 4 | 15 February 2007 14:41 |
|
|