RiVa AMMX Benchmarks
Since start of the Apollo 68080 adventure, Stephen Fellner, author of the RiVa MPEG Player, was kind enough to share sources of his player with the Apollo team.
As some of you might have followed the changes in core over the last months, AMMX instructions were introduced in the Apollo-Core, meaning that software that take benefit from them experience big speedups. With the dedicated work of buggs and flype, RiVa has been modified to take advantages from those AMMX instructions. Here are some results : http://i.imgur.com/3sqaBkk.png Core: Apollo 68080 AMMX Core, Revision 3543, x11 speed RiVa Parameters : VERBOSE NOAUDIO DISPLAY=HICOLOR NOSKIP FPS=1000 Download links : Original RiVa TopGun 320 Video TopGun 640 Video IQ has also been greatly improved (YUYV versus R5G6B5 quality) and stereo audio has also been enabled ! All this is still WIP and should be part of next GOLD2 release ;) Stay tuned ! http://fpga.amiga-ng.org/resources/Vampire/riva.jpg |
This is very awesome! :D
|
This is just empty talk without showing the actual code (both before and after).
I can get +200% and more by just rewriting compiled code into ASM, no need for AMMX. |
I think it would be great if you joined in and contributed to make riva faster also for regular amigas.
|
Quote:
Thanks for your encouraging message ! We would be very happy to have your skills in team to continue the improvement. Would you like to join ? P.S. : RiVa is already 100% ASM. |
Quote:
|
Quote:
Quote:
Quote:
Quote:
|
Here is a screenshot of improvement on 68060 (A4000 with CS060MK-I and Cybervision 64) :
http://bax.comlab.uni-rostock.de/fil...VA052vs050.jpg Original 0.50 on top, new at bottom That's a 17.85% improvement. Looking forward for 200%. |
Quote:
I agree. It seems little large for. But it is. I can confim you it is 100% ASM, not a single line in C - even for reading the tooltypes and command line arguments :-) It use the MPEGA.library for decoding the audio (which is also written in ASM). The full chain is ASM afaik. The original source code, kindly shared by S. Fellner himself, is about 400KB of pure ASM (and some tables), and main file is about 15 000 lines, which is quite equivalent to the MPEGA itself. Legacy code is already impressive and optimized ; it has been preserved even for the Apollo project and some improvements have been added in code. The work done by Buggs is a very serious one, respectful to the legacy code and brings new AMMX dedicated code but not only (68060 benefits also from the recent work). Maintaining RiVA is a serious project which needs serious 68k skills - we speak in it of all the 68k features panel, superscalar, inst/data caches, cache hits, branch predicts, accuracy of decoding, accuracy of rendering, accuracy of testings, ... Other facts, the RiVA project compiles on my V600 in less than 8 seconds using last VASM_mot for 68000... and 68080 support. This is very acceptable compile time against a gcc project, by far. Of course 100% ASM is hardly maintainable and requires some weeks to understands. |
Yeah maybe, but mpega.library isn't 100% asm to start with (only critical parts were initially asm ; i got significant speedup by rewriting the huffman decoding). It's around 250-300k source (from resourced), without tables. It was compiled with SAS/C in 020+ mode (yeah i can see that when disassembling, heheh).
If RiVa isn't much bigger than mpega then it's perfectly manageable. Now i'm wondering if old MPEG-1 is still worth the trouble, with all these MPEG-4 videos all around... I don't believe in mmx and related stuff, and will not until i see normal handwritten asm beaten by a large amount, counting individual clocks in the routine(s). This is why :scream i want to see the code :scream (i have the slight impression that i will have to repeat my last sentence a few times...) |
Well Meynaf, you'd like to see some code? Here you go. Core loop inhorizontal interpolation as an example. Hope, the post ain't too long.
Original (core loop over two pixels, without proper rounding): Code:
.y_xloop move.b (a1)+,d2 ;d2: --- --- --- 1 Code:
move.l (a1),d1 ; P00 P01 P02 P03 Code:
LOADAB 1,0 ; LOAD (A1),B0 |
Quote:
This is horizontal interpolation (from jpeg decoder) - try to rewrite it with SIMD if you wish : Code:
; a0 = input, a1 = output Upsampling isn't very important in the final timing. The most important code is the DCT. As it's supposed to be done with this data parallelism stuff, well, it's that i want to see. Good luck without SIMD multiply. Btw 2. The parallel instructions you use here are not documented anywhere. They just come out of nowhere and i'm supposed to trust this... Btw 3. You have to understand that this SIMD stuff will only work for very simple tasks. As soon as it becomes relatively complex, it starts to fail miserably - or you'll have to create new instructions for almost everything you do. I'm not against doing things in parallel, i'm against creating new big fat registers for the sake of speed. Btw 4. This example can be rewritten by creating a simple longword parallel byte average instruction. One instruction added instead of a whole block, same timing if executed on two pipes. |
Quote:
Quote:
Quote:
Quote:
Quote:
|
Quote:
|
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
|
Quote:
So what are your reasons? If a few ammx instructions can speed up movie replaying by a factor of two, they must be very convincing reasons. |
Quote:
But you can't compare. Intel throw massive amount of logic gates at every problem they face. There is also a good reason why their cpus don't go in handheld devices. Quote:
What is movie replaying anyway ? Old MPEG-1 ? You can't show youtube videos with that. Play them with your smartphone : they won't use any mmx-like extensions - they'll use a gpu. So perhaps a few ammx extensions can make better performance in a single example. One can always invent new instructions for a particular case. But next program you do, new ammx extensions will be needed. All that for a miserable x2 speed on something that's not used anymore. I have disassembled Riva and seen that this is indeed asm code, but the guy apparently "played compiler" ; it seems the original code has been followed without much refactoring. For example, the DCT shifts the result back too early after the multiplies, leading to a loss of speed (unneeded shifts) and loss of quality (reduced accuracy). There is also a lot of duplicated code (if not dead code). In short it's not exactly nice - not surprising a few code rewrite got a speedup. Again, mmx isn't needed for that. |
Since you know how to do it so much better: when will see your non-ammx version of the code beating the ammx-version?
|
Quote:
Anyways the "ammx version" doesn't need my code to be beaten : Code:
4.Upload:> riva-0.50 verbose fps=1000 noskip noaudio dither=gray shk_topgun_320.mpg |
Quote:
BTW, the fact that you had to use the dither=gray option to make your UAE beat the 080 proves two points: AMMX is a very powerful extension to the 68k ISA and you need to buy a newer PC. |
All times are GMT +2. The time now is 14:03. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.