English Amiga Board


Go Back   English Amiga Board > Support > support.WinUAE

 
 
Thread Tools
Old 04 April 2018, 22:10   #201
mark_k
Registered User
 
Join Date: Aug 2004
Location:
Posts: 3,333
Something I noticed when experimenting with full-window mode... I wanted to reduce Windows desktop resolution to (hopefully) allow more slices to be used. Monitor native res is 3840×2160 so I wanted to set desktop to 1920×1080.

This is on Windows 10 1709 with AMD graphics card. Windows is very keen on scaling its desktop to the monitor native res. If you go into Windows display settings and change resolution shown in the drop-down box from 3840 × 2160 to 1920 × 1080, the actual video signal output is still 3840×2160. That's even though I had disabled the GPU scaling option in Radeon settings.

In order to get a real 1920×1080 display as confirmed by monitor OSD info, I had to click Display adapter properties, then List All Modes, then choose 1920×1080 from the list there.

Anyway... Without setting a real 1920×1080 display mode (just changing res in Windows display settings), WinUAE's timing is messed up. It seems to detect the actual video signal being output and bases timings on that, whereas each of the 1080 scanlines is actually being output twice (more or less).

I'm not sure how good results would be with correct timings in that case, but here's some log output:
Code:
D3D11 init start. (1920*1080) (1920*1080) RTG=0 Depth=32.
CheckFeatureSupport(DXGI_FEATURE_PRESENT_ALLOW_TEARING) = 00000000 1
D3D11 found matching refresh rate 60/1=60.00. SLO=1
D3D11 Device: AMD FirePro W5100 [\\.\DISPLAY1] (0,0,1920,1080)
D3D11CreateDevice succeeded with level 11.1. Hardware accelerated.
D3D11 2 00000800 00000057
D3D11_resize 0 0 0 (0)
D3D11 init end
D3D11 resize do
D3D11_resize -> none
D3D11 resizemode start
D3D11 freed3d start
D3D11 freed3d end
D3D11 resizemode 1920x1080, 1920x1080 2 00000800 FS=-1
D3D11 initd3d start
-> -0.000000 0.000000 0.000000 -0.000000 0.000000 0.000000
D3D11 initd3d end
D3D11 1920x1080 main texture allocated
POS (0 0 1920 1080) - (0 0 480 270)[480,270] (720 405) S=1920*1080 B=1920*1080
0 0 4 4
-> -3840.000000 2160.000000 3840.000000 -2160.000000 4.000000 4.000000
D3D11 resizemode end
D3D11 resize exit
D3D11 376x288 main texture allocated
POS (0 0 1920 1080) - (-772 -396 -292 -126)[480,270] (720 405) S=1920*1080 B=376*288
0 0 4 4
-> -752.000000 576.000000 752.000000 -576.000000 4.000000 4.000000
Buffer size (376*288) Native
ActiveHeight: 2160 TotalHeight: 2222 VFreq=60/1=60.00Hz HFreq=533250000/4000=133.313KHz
Spincount = 17331
...
ActiveHeight: 2160 TotalHeight: 2222 VFreq=60/1=60.00Hz HFreq=533250000/4000=133.313KHz
PAL mode V=60.0000Hz H=15625.0879Hz (227x312+0) IDX=10 (PAL) D=0 RTG=0/0
D3D11 376x287 main texture allocated
POS (0 0 1920 1080) - (-772 -396 -292 -126)[480,270] (720 405) S=1920*1080 B=376*287
0 -1 4 4
-> -752.000000 580.000000 752.000000 -568.000000 4.000000 4.000000
Buffer size (376*287) Native
RTGFREQ: 312*60.0000 = 18720.0000 / 60.0 = 312
D3D11 Shader and extra textures restored
mark_k is online now  
Old 05 April 2018, 02:03   #202
mdrejhon
Chief Blur Buster
 
mdrejhon's Avatar
 
Join Date: Mar 2013
Location: Toronto, Canada
Posts: 40
Quote:
Originally Posted by mark_k View Post
DXGI 1.2 supports partial presentation, see Enhancing presentation with the flip model, dirty rectangles, and scrolled areas on MSDN. With that you'd use DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL or DXGI_SWAP_EFFECT_SEQUENTIAL and specify the dirty rectangle to be the just-rendered strip. (DXGI_SWAP_EFFECT_DISCARD might work as well, if Windows allows it. Since we don't care what's in the rest of the display, just the current strip.)

Also, could partial texture updates help? Use ID3D11DeviceContext::UpdateSubresource() to partially update the texture? Or maybe use multiple textures, one for each strip. Before each Present() you'd only update one texture.

Is there any way to visualise how much GPU bandwidth is being used, in real-time? (That can co-exist on-screen with WinUAE preferably.)
Interesting tip. I'll have to research this if this will increase frameslice throughput to achieve single-scanline-height frameslices.

At 7000 frameslices per second on 2560x1440 on my computer, that's up to 77 gigabytes per second of memory throughput (24bit framebuffers being blitted repeatedly over and over). Memory bandwidth is the bottleneck for VSYNC OFF based beam racing. But that can apparently be avoided (and also increasing frameslice throughput too!).

I've had success avoiding that memory bandwidth (to an extent) simply by not pre-clearing the framebuffer before rendering into it. VSYNC OFF still internally uses 2 different frame buffers, alternating between the two when flipped.

According to my tests when I leave my buffer uncleared and keep flipping the two, it alternates pre-existing buffer junk between two different buffers between the tearlines -- the front buffer becomes my back buffer, and the back buffer becomes the front buffer.

They retain their original content, so this is a bandwidth-saver piggyback opportunity. I see it because if I flip back and fourth without doing anything, it simply flashes between the most recently 2 frames rendered. This is a beneficial behaviour that can be piggybacked upon for more memory-bandwidth-optimized VSYNC OFF frameslice beamracing.

To make beam racing compatible with that, one could skip preclearing the buffers and blit the most recent two emulator frameslices only. That'll cover both of the framebuffers. But no guarantees that all VSYNC OFF implementations uses this "back<->front" buffer-trading algorithm. Probably best done as a configuration option for a "memory-bandwidth-reducing" setting. Will help make beam racing more compatible with laptop GPUs running off non-VRAM.

Shaders/fuzzylines/HLSL type stuff will have to be modified to be bleed-aware to prevent introducing artifacts at frameslice boundaries, but could still piggyback off this memory-bandwidth-saver trick.

We need to reuse the whole emulator frame buffer ideally, if we want a full refresh cycle of jitter safety margin (e.g. variable distance between emulator raster & real raster).

Last edited by mdrejhon; 05 April 2018 at 04:18.
mdrejhon is offline  
Old 05 April 2018, 13:04   #203
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,502
Quote:
Originally Posted by mark_k View Post
ActiveHeight: 2160 TotalHeight: 2222 VFreq=60/1=60.00Hz HFreq=533250000/4000=133.313KHz
This means mode is still scaled (2160 pixel height). Or someone is lying.

Quote:
Could there be any scope for reducing GPU memory bandwidth as this feature is refined? That could allow lower-end systems to benefit from this feature, and higher-end ones to use more slices.
Perhaps but I am not going to do anything complex because easiest solution is the usual: buy a better PC. Single slice is still much better than normal mode.

Quote:
DXGI 1.2 supports partial presentation, see
Isn't that more to do with non-flip mode where copying is needed? I don't see how partial presention can help with flipping.

Windowed mode (where flipping can't work) needs variable sync monitor anyway because no one uses 50Hz desktop = can't be too old PC. (100Hz also means not too old PC)

Partial texture copy can be done but it also won't help much because Amiga internal resolution is relatively small (768*568 or so in normal configuration).

I'll test if not clearing helps.
Toni Wilen is offline  
Old 05 April 2018, 14:05   #204
mark_k
Registered User
 
Join Date: Aug 2004
Location:
Posts: 3,333
Quote:
Originally Posted by Toni Wilen View Post
This means mode is still scaled (2160 pixel height). Or someone is lying.
Right, that log was from the case where Windows desktop is 1920×1080 but Windows (or GPU driver) scales it to 3840×2160 for output. You can detect cases like that since the full-window pixel height doesn't match the detected display height. Typically the scale factor would be a small integer or "nice" fraction (e.g. 1080⇒1440 is 4/3).
Quote:
Isn't that more to do with non-flip mode where copying is needed? I don't see how partial presention can help with flipping.
The MSDN page puts more emphasis on flip being the recommended way. That is, DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL being preferable to DXGI_SWAP_EFFECT_SEQUENTIAL. In connection with DWM it says "By using flip model, back buffers are flipped between the runtime and Desktop Window Manager (DWM), so DWM always composes directly from the back buffer instead of copying the back buffer content."

I assume that when composing each frame, the graphics driver can take the previous frame's image (in GPU memory) and just copy the small changed part to it, rather than copying the entire image from PC system memory?

Quote:
Windowed mode (where flipping can't work) needs variable sync monitor anyway because no one uses 50Hz desktop = can't be too old PC. (100Hz also means not too old PC)
It is possible in most cases to set desktop refresh rate to 50Hz. [But some monitors duplicate or drop frames so as to always refresh the panel at 60Hz... So there it's kind of pointless.]

Did you mention before about being able to get tearing in windowed mode with a Nvidia graphics card? (On my AMD card I can get tearing in full-window and full-screen modes but not windowed.)
mark_k is online now  
Old 05 April 2018, 18:39   #205
mdrejhon
Chief Blur Buster
 
mdrejhon's Avatar
 
Join Date: Mar 2013
Location: Toronto, Canada
Posts: 40
Quote:
Originally Posted by Toni Wilen View Post
Partial texture copy can be done but it also won't help much because Amiga internal resolution is relatively small (768*568 or so in normal configuration).
But what about the final buffer after full shader effects are applied -- that makes it a much higher resolution frame buffer. Those are rather big framebuffers, aren't they? (The shader would obviously need to be slice-compatible, though)

(I will have to do benchmark tests if it improves things much or not....Lemmesee)
mdrejhon is offline  
Old 05 April 2018, 20:23   #206
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,502
CopySubresourceRegion() is used now. I didn't see anything different without ClearRenderTargetView() (Clearing is probably very optimized in GPUs because it is common operation). It is not disabled because it would also remove debug colors.

Quote:
Originally Posted by mdrejhon View Post
But what about the final buffer after full shader effects are applied -- that makes it a much higher resolution frame buffer. Those are rather big framebuffers, aren't they? (The shader would obviously need to be slice-compatible, though)
Thats the big one. Many shaders are "non-linear" vertically, not sure if it can be done without adding large safety marging.
Toni Wilen is offline  
Old 05 April 2018, 20:54   #207
mdrejhon
Chief Blur Buster
 
mdrejhon's Avatar
 
Join Date: Mar 2013
Location: Toronto, Canada
Posts: 40
Quote:
Originally Posted by Toni Wilen View Post
Thats the big one. Many shaders are "non-linear" vertically, not sure if it can be done without adding large safety marging.
Agreed, the shader would have to be frameslice-aware. Not easy programming.

That said, due to scanout, each frameslice is a lag gradient themselves.
-- 1 frameslice is [0...1.00/60sec lag]=16.7ms difference top edge to bottom edge of each
-- 2 frameslice is [0...0.50/60sec lag]=8.33ms difference top edge to bottom edge of each frameslice
-- 3 frameslice is [0...0.33/60sec lag]=5.55ms difference top edge to bottom edge of each frameslice
-- 4 frameslice is [0...0.25/60sec lag]=4.17ms difference top edge to bottom edge of each frameslice
- 10 frameslice is [0...0.10/60sec lag]=1.67ms difference top edge to bottom edge of each frameslice
(Lag granularity / lag gradients become finer, the more frameslices...)

Thusly:
10 frameslice + 108-line offset (2-frameslice lag instead of 1-frameslice lag)) = (1.67ms lag + 1.67ms lag) = 3.33ms lag in the emuraster-vs-realraster chase margin.

So a 108 line offset (between emu-raster and real-raster) isn't too bad, to compensate for things like curved-simulated CRTs and ambient bleed/fuzz effects. Like halos around bright parts. Long term, I think the chase distance between emuraster + realraster should probably be a slider adjustment while watching a horizontal-panning motion test, to observe for glitches/tearing artifacts.

Nontheless, it can be a low-priority "nice-to-have". Most modern NVIDIA/AMD GPUs in the last 5 years are powerful enough to just do a full shader re-render 4 times a refresh cycle. Even 4-frameslice is still one-quarter refresh cycle latency (75% lag reduction). Sub-frame rendering via real time beam racing + shader enabled = still beautifully viable with full shader re-renders. Pinball has never been this much fun in an emulator before today!

Last edited by mdrejhon; 06 April 2018 at 19:50.
mdrejhon is offline  
Old 07 April 2018, 22:18   #208
mark_k
Registered User
 
Join Date: Aug 2004
Location:
Posts: 3,333
Perhaps another thing to try would be, instead of a full-size Amiga texture (752×576 or whatever) and partially updating that, have a smaller texture corresponding to one strip. That would be mirrored over the output surface (i.e. whole screen) with D3D11_TEXTURE_ADDRESS_MIRROR. Hopefully that mirroring would mean that shaders which sample from rows above/below each strip don't look too bad.

Maybe also multi-threaded Present() could help? [Assuming you don't do that already.] So have two thin strip-size textures. Emulate/render to one, tell other thread to Present(), then emulate/render into the other texture, tell other thread to Present() etc.

For shaders, could modifying them to do something like this help reduce memory bandwidth?
Code:
if (current_line < strip_top || current_line > strip_bottom)
    return black;
else
    {do whatever shader does normally}
Whether there's any performance difference and how much would vary depending on OS and graphics driver/hardware I guess.
mark_k is online now  
Old 08 April 2018, 09:58   #209
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,502
Having only Present() in separate thread won't work unless you guarantee no D3D calls are done by other threads when calling present. It also does not gain anything because normally Present() returns immediately. if it does not, it means previous rendering was not ready and present gets delayed, possibly causing glitches. (=too slow CPU or GPU). Threading won't fix that.

Default shader is very simple, it just sets does minimum needed to set pixel color (like pre-shader hardwired rasterizer would do), comparisons would only make it slower.

I am asking again: is this really worth the trouble? No one has mentioned about any major slow downs with ~3-4 slices. Lets first see what happens when first official beta is released.
Toni Wilen is offline  
Old 08 April 2018, 20:00   #210
Retro-Nerd
Missile Command Champion
 
Retro-Nerd's Avatar
 
Join Date: Aug 2005
Location: Germany
Age: 52
Posts: 12,435
Quote:
- Input is read after each slice.
- Default is 4 slices which makes max input latency about 6ms. (Assuming quality "gaming grade" USB input devices that use <=1ms USB rate.)
What about the input lag of the used monitors? This adds a bit to the total lag. So even with the best Acer Predator it's app. 9ms then, right? Or did i miss something?

Last edited by Retro-Nerd; 08 April 2018 at 21:36.
Retro-Nerd is offline  
Old 09 April 2018, 05:22   #211
mdrejhon
Chief Blur Buster
 
mdrejhon's Avatar
 
Join Date: Mar 2013
Location: Toronto, Canada
Posts: 40
Quote:
Originally Posted by Retro-Nerd View Post
What about the input lag of the used monitors? This adds a bit to the total lag. So even with the best Acer Predator it's app. 9ms then, right? Or did i miss something?
Yes, monitor lag has a factor but most monitor lag tests use a "VSYNC ON" lag tester, so the numbers need to be interpreted differently.

9ms on DisplayLag.com = 2ms using beam racing

Click here for technical explanation

Most lag test websites measure using a "Leo Bodnar Lag Tester" device which is a 60 Hz VSYNC ON Lag Tester and the lag stopwatch is from VBI-to-screen-middle. Half of 1/60sec is (16.7ms)/2 = 8.3ms .... So a 60 Hz CRT has 8.3ms lag on CENTER square with a Leo Bodnar Lag Tester, because it's a 60 Hz VSYNC ON Lag Tester. The lag you will get with VSYNC OFF or beam racing (tearingless VSYNC OFF) is up to half-a-refresh-cycle less lag than the CENTER SQUARE of Leo Bodnar Lag Tester. With beam racing, all squares equallize in lag. The screen-CENTER test and BOTTOM-edge test has the same lag as TOP-edge. TOP(~2-3ms) == CENTER(~2-3ms) == BOTTOM(~2-3ms) on the current best low-lag LCD gaming monitors that do real time synchronous scanout (cable scan out = panel scan out). That's GtG overhead, as that several modern gaming LCD monitors has no monitor-side framebuffer delay (doing line-buffer processing) and it actually realtime scans the video signal nearly straight onto the LCD panel. I've confirmed this via high speed camera.

TL;DR: That particular Acer Predator is only ~1-2ms slower than the CRT via the Leo Bodnar lag test stopwatching criteria.

I'm the inventor of some lag testing techniques too, as I am the world's first person to successfully test the input lag of GSYNC (I wrote that in 2013!).
And I meet other lag testing websites (including DisplayLag.com too). So by now, you know I'm a known expert in display lag behaviours!


Last edited by mdrejhon; 09 April 2018 at 05:46.
mdrejhon is offline  
Old 09 April 2018, 11:25   #212
Retro-Nerd
Missile Command Champion
 
Retro-Nerd's Avatar
 
Join Date: Aug 2005
Location: Germany
Age: 52
Posts: 12,435
Thanks for the explanation. This makes the new method even more impressive.
Retro-Nerd is offline  
Old 10 April 2018, 08:44   #213
ReadOnlyCat
Code Kitten
 
Join Date: Aug 2015
Location: Montreal/Canadia
Age: 52
Posts: 1,178
Quote:
Originally Posted by mdrejhon View Post
Fellow software developer....

There's no buffer waiting in this new version even for mid-screen input reads.
It's sub-frame latency even for mid-screen input reads!!
Toni just successfully implemented tearingless VSYNC OFF

[snip, an unbearably long wall of text]
Fellow kitten, thank you for the detailed reply.

First of all I want to make clear that I understand what the method is about as well as all the subtleties of beam racing, page flips / present, vsync, how your method works and how Toni implemented it: producing pixels synchronously to the display refresh has been the dream of every video game coder even before the Amiga existed so this is not a new concept for this kitten.

Second, you may not have noticed but you have a tendency to repeat yourself several times in your posts which makes them very difficult to read. I really recommend that you take your time eliminating redundant information and jargon and simplifying.
I cannot blame you, I was like you before, but believe me: the less you say, the more people understand and listen.

Now onto my main point:

Quote:
It's like bufferless VSYNC ON.
Tearlines are simply just rasters. We're simply raster-timing the tearlines out of the way
Toni just successfully implemented raster-synchronized VSYNC OFF.
Tearing never appears
I was talking about hypothetical tearlines which *would* appear *if* one started to display the next frame mid raster, which is not what your method is about. I was not actually implying that your method produces tearlines.

But this is actually secondary, so I will not delve ont this point.

Quote:
To genuinely get less lag than the original machine, you simply "cheat" by doing surge-execute cycles followed by surge-scanout cycles. For the average joe user, nothing amiss is noticed if the frames output at their regular rate -- but the emulator can cheat via accelerated beam-racing an accelerated-scanout display. Basically running Amiga CPU 4x faster while real-time streaming pixels out to a 4x-faster-scanout display ("60Hz" refresh cycles scanned out in 1/240sec ... with a pause of 3/240sec before doing the next surge-execute). It's now possible with the latest WinUAE beta, I just helped Toni make beam racing VRR compatible...
There are simpler ways to say the same thing. I had to parse this sentence several times to understand what you actually meant just because of the surge-scanout-stuff keywords. They sound nice but they hide the simplicity of the method.

To rephrase it (correct me if I am wrong):

Since the display refresh rate is 4x that of the emulated machine's and the emulator can also emulate 4x faster, we emulate the current frame in 1/240s, and we immediately present the resulting frame buffer to the 240Hz display, then we keep that frame constant for 3/240s.

Essentially what this does is compress the Amiga frame execution time to 1/4th of its duration and then idle until the next frame needs to be executed. All while keeping the rate of visual frames output at a steady non jittering 60Hz.

Quote:
That shortens the time between input read and pixels on the screen, no matter where the input read is -- as long as the input read is somewhere further up from the bottom of the screen. If the input read is during the VBI, there is up to 1/60sec less lag than the original machine in the "infinite-fast-surge-execute" situation. If the input read is during the center-of-screen, there is 0.5/60sec less lag than the original machine.
That is not correct.

This shortens the time between input changes and pixel changes for all inputs occurring in the top 1/4th of the Amiga frame.
For all inputs which occur after that 1/4th of a frame it introduces an additional latency of one full 60Hz frame since they will be taken into account at the next 60Hz frame.

This does not reduce latency at all, but gives it a bias toward the beginning of the (Amiga) frame, which is very different. Essentially this moves the input scanning point to the top of the Amiga frame rather than for example the mid screen.

In order to claim that this benefits you must assume that most inputs occur in the top 1/4th of the screen of an Amiga frame which is frankly dubious given that inputs are equally likely to occur at any moment during a frame.
Humans are not precise enough to time action in the top quarter of a 60Hz frame, especially in the digital domain (*).

And this is especially true given that CRT images are painted progressively so the most important elements of an image are not even visible at the top of the screen: even if visual-to-hand reflexes had sub frame granularity, they would most likely occur at mid-screen anyway when the information is available.

Quote:
Now, we can combine VRR and beam racing for realistic "less-than-original" lag.
If that system works, it is by definition not realistic in any way since the original machine did not behave like that.

Quote:
P.S. I'm located in Canada (Hamilton, specifically). So you code in Montreal?
My employer indeed pays me to do so.


That is all for today, thanks for bearing with me. (ˆˆ)/

(*) in the analog domain, there is some evidence that they can at least detect (as opposed to respond to) very fine lag (millisecond) but this is likely an indirect detection made possible by the easier predictability of the response to analog movements.
ReadOnlyCat is offline  
Old 10 April 2018, 18:46   #214
mdrejhon
Chief Blur Buster
 
mdrejhon's Avatar
 
Join Date: Mar 2013
Location: Toronto, Canada
Posts: 40
Quote:
Originally Posted by ReadOnlyCat View Post
you may not have noticed but you have a tendency to repeat yourself several times in your posts which makes them very difficult to read. I really recommend that you take your time eliminating redundant information and jargon and simplifying.
I'm actually aware, but I am unnecessarily long sometimes. Noted.
Sometimes my long writing style works better in other contexts (e.g. my 480Hz tests and the animated 1000Hz Journey articles) but in these forum contexts I need to remember to adjust, especially since forum posts are not proofread/diagrammed/animated like my articles. It's my style, but I'll work to reply shorter to your quotes now.

Quote:
Originally Posted by ReadOnlyCat View Post
Since the display refresh rate is 4x that of the emulated machine's and the emulator can also emulate 4x faster, we emulate the current frame in 1/240s, and we immediately present the resulting frame buffer to the 240Hz display, then we keep that frame constant for 3/240s. Essentially what this does is compress the Amiga frame execution time to 1/4th of its duration and then idle until the next frame needs to be executed. All while keeping the rate of visual frames output at a steady non jittering 60Hz.
Yes, yes. You are correct here.

Quote:
Originally Posted by ReadOnlyCat View Post
That is not correct.
I spent 20 minutes carefully analyzing your words, redoing math formulas on paper, and:
- We might be thinking or lag stopwatching differently. Different lag stopwatch start/stop.
- For my lag stopwatch, there's no such thing as "1/4" except in a 4-frameslice situation.
- For my lag stopwatch, it is not a function of Emulator:Realworld scanout velocity difference.

As long as input-read is done anywhere later in refresh cycle, and emulator scanout velocity is faster, there's a lag reduction versus original machine under the specific lag stopwatch criteria:

Quote:
Originally Posted by ReadOnlyCat View Post
In order to claim that this benefits you must assume that most inputs occur in the top 1/4th of the screen of an Amiga frame which is frankly dubious given that inputs are equally likely to occur at any moment during a frame.
Under my lag stopwatch criteria for this case, there's no such thing as "1/4", nor "first 1/4", nor "after first 1/4". And biggest difference occurs near bottom edge of screen. It is proportional to scan line number, regardless of scanout velocity. So the bigger emu-ahead-of-real difference occurs the closer the input read is done to bottom edge.

Setting specific math variables to avoid confusion:
-- Stopwatch start for this one, is beginning of visual refresh cycle (moment scanline #1 begins)
-- Stopwatch stop for this one, is input read & visual action
-- Original 8-bit machines can react during (or below) the scanline the input read occured. e.g. doing input read in scanline right above pinball flipper sprites.
-- Real time input reads mid-scanout, with real raster racing tightly behind emulator raster
-- For math simplicity, temporarily ignore display lag (instant GtG) and frameslice granularity (e.g. theoretical 1-scanline-tall frameslices or frontbuffer beam racing)

For realtime input read in exact middle of a emulator refresh cycle (1/2 of 1/60sec):
-- Original on 60Hz has already displayed half of the refresh cycle to human eyes in 1/120sec (halftime of 60Hz scan)
-- Emulator on 240Hz has already displayed half of the refresh cycle to human eyes in 1/480sec (halftime of 240Hz scan)
-- Assuming (A) beam racing, and (B) real time input reads during beam racing
-- Your lag savings is the mathematic difference between 1/480sec and 1/120sec. That means 3/480sec less lag than original machine under this lag stopwatching criteria.

Our confusion may be simply merely different lag stopwatching criteria (start/end) and there are many ways (audio stimuli, visual stimuli, lag differential of concurrently executing machines, etc). If you do theoretical high speed video of two concurrent machines running at exactly same software at exactly the same time (1 real device, 1 emulator) at the same refresh-begin intervals -- then the input read occurs much sooner on the emulated machine for bottom-edge scanout. If you were thinking about a different lag stopwatching criteria, then you might be very well correct, then our confusion is resolved. Indeed, due to the way emulator audio is necessarily buffered during fast-scanout, lag-stopwatching via audio is necessarily different from visual stimuli. I am a deafie, so I'm a visual-stimuli guy myself... And even isolating to visuals -- there's even more than one legitimate way to lag-stopwatch visual stimuli. Such as scanline of screen reaction, rather than scanline #1. Screen reaction (e.g. player action) can occur in a vertically different part of screen than the trigger (e.g. danger/obstacle you react). So they can benefit or handicap depending on whether they're above or below each other. So some lag stopwatching criteria only show lag advantages in certain games, while other lag stopwatching criteria show lag advantages in yet different games or apps (e.g. drawing programs -- drawing lag). There are specific cases where a faster scanout may handicap, e.g. giving you less time to react. But that's not universal; depending on layout of onscreen activity and input reads. Unfortunately, not a single lag-stopwatching measurement method fits all.

Have I explained better? Or have I made an error? (unlikely)... I should have posted a diagram instead of writing so many words, eh? Some old Blur Busters scan diagrams include 144fps at 144Hz GSYNC versus 100fps at 144Hz GSYNC demonstrating the behaviour of two different frame rates on the same 144 Hz GSYNC monitor. Upon request, I can create new diagrams demonstrating the lag-savings if preferred. People like my diagrams and TestUFO animations that I make, more than my words sometimes!

Quote:
Originally Posted by ReadOnlyCat View Post
If that system works, it is by definition not realistic in any way since the original machine did not behave like that.
Indeed unnatural, but regardless, lower lag than any other possible emulator lag-reducing method (input delay, hard GPU sync, VRR-only benefit, etc). Many tricks are combineable -- beam racing VRR refresh cycles apparently works!

Quote:
Originally Posted by ReadOnlyCat View Post
(*) in the analog domain, there is some evidence that they can at least detect (as opposed to respond to) very fine lag (millisecond) but this is likely an indirect detection made possible by the easier predictability of the response to analog movements.
<offtopic but interesting>
Quite possible indeed! Another factor: Even when a human cannot feel the millisecond, there's also the "cross-the-finish-line" effect. The "see each other, react at same time, frag each other at same time". With near-identical human reaction times, the input lag of equipment can be the deciding factor of a specific reaction-time win. The "It seems like my reaction time seems better with this lower-lag setup" is a powerful effect even if one can't always feel the millisecond or few directly. When competing in professional leagues, the reaction time spread is tighter, so tiny lag differences matter more. This affects statisticals wins in their favour.
</offtopic but interesting>

Last edited by mdrejhon; 10 April 2018 at 20:43.
mdrejhon is offline  
Old 14 April 2018, 10:18   #215
mark_k
Registered User
 
Join Date: Aug 2004
Location:
Posts: 3,333
Does WinUAE's current lagless vsync code require D3DKMTGetScanLine()? According to MSDN that only works on Windows Vista and later. For Direct3D 9, could GetRasterStatus() be used instead? That could allow lagless vsync to be XP-compatible.


Just for completeness and in case any other emulator devs are reading this thread, it could be possible to support lagless vsync in DirectDraw too, via GetVerticalBlankStatus() and GetScanLine().

GetScanLine() could be useful since the number it returns includes lines in the vertical blanking interval. MSDN says "The returned scan line value is in the range from 0 through n, where 0 is the first visible scan line on the screen and n is the last visible scan line, plus any scan lines that occur during the vertical blank period. So, in a case where an application is running at a resolution of 640×480 and there are 12 scan lines during vblank, the values returned by this method range from 0 through 491."
mark_k is online now  
Old 14 April 2018, 11:33   #216
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,502
It isn't that simple. QueryDisplayConfig() is Vista+ only and it is needed to find out horizontal scanrate and total line count. Without it only 1:1 modes can work (no GPU scaling).
Toni Wilen is offline  
Old 15 April 2018, 15:04   #217
mark_k
Registered User
 
Join Date: Aug 2004
Location:
Posts: 3,333
I wonder if some kind of calibration could be feasible, like the old low-latency vsync calibration? Loop calling GetRasterStatus() until returned scanline# loops back to 0 to determine number of active lines. Use QueryPerformanceCounter() or RDTSC to time active period (divide by # active lines to get scan rate). Time how long the returned scanline value remains 0 for to determine how long vertical blanking is (and hence calculate # of scanlines in vblank).

How often does WinUAE currently call D3DKMTGetScanLine()? Is there much overhead for calling that routine? Maybe overhead is driver-dependent, and some hardware configs could work better being manually timed???
mark_k is online now  
Old 15 April 2018, 15:07   #218
mark_k
Registered User
 
Join Date: Aug 2004
Location:
Posts: 3,333
Talking again about DirectDraw, that could be a good way to get lagless vsync working on low-end/older systems running 98/ME/XP (or Vista/7 with DWM disabled). Maybe the RetroArch guys could investigate that, since they already back-ported to Windows 98 . I think you can write directly to the framebuffer, so an emulator could either do that (writing one slice at a time), or render a slice then stretch-blit that small region to the framebuffer. Little or no memory bandwidth wasted.
mark_k is online now  
Old 15 April 2018, 19:24   #219
rare_j
Zone Friend
 
rare_j's Avatar
 
Join Date: Apr 2005
Location: London
Posts: 1,176
As someone who uses WinUAE on windows XP I would not expect to see lagless vsync as an option. I'm grateful enough WinUAE runs at all on XP.
Anyway it's doubtful a windows XP system would have a high enough spec to do lagless vsync properly.
rare_j is offline  
Old 16 April 2018, 04:48   #220
mdrejhon
Chief Blur Buster
 
mdrejhon's Avatar
 
Join Date: Mar 2013
Location: Toronto, Canada
Posts: 40
Quote:
Originally Posted by mark_k View Post
Talking again about DirectDraw, that could be a good way to get lagless vsync working on low-end/older systems running 98/ME/XP (or Vista/7 with DWM disabled).
I have come up with a way to accurately predict the scan line number (to an accuracy of ~0.1 - 0.2%) without GetScanLine.

I only need to know the timestamps of VSYNC events, and thus, I can beam-race any platform that supports tearlines (Linux, Mac, Android, PC) as long as I have access to a high-precision counter such as QPC or RDTSC or std::chrono::high_resolution_clock (et cetra). Though a hook may be needed to access precision counters (e.g. RDTSC), lagless VSYNC can optionally be brought to older-spec Windows systems with sufficient frameslice hroughput (240 frameslices per second at 4 frameslices per refresh cycle).

Even a 10%-20% error in scanline prediction is still good enough for 10-frameslice beam racing with a 2-frameslice trailbehind (2/10ths of a refresh cycle extra jitter margin). Knowing the VBI size helps, and allows you to tighten the beamrace margin. But is only cake frosting as I easily get to less than 2-3% error without knowing VBI size. VBI knowledge simply improves raster guess to ~0.1 to 0.2% with VSYNC timestamps.

I've gotten my cross-platform beam racing demo far enough along to run on 2 separate platforms with exactly the same code. I think this is probably a world's first. Cross platform beam racing is a reality with my code!

Zero need for direct ScanLine register if I have access to 90% of VBI timestamps (I dejitter timestamp inaccuracies & extrapolate missed VBIs). I provide a simulated .ScanLine from my RasterCalculator class, and it keeps incrementing into VBI too (so it rasterguesses scan line number in VBI too!).

You *can* provide an optional platform-specific hardware ScanLine hook into my module, but it's completely optional. 10-frameslice beam racing (like WinUAE) only needs ~10-20% error margin, and I'm getting all the way to 0.1% error with completely software raster guesses as simple precision offsets from VSYNC timetamps -- [ Show youtube player ].

I'm using C# .NET programming + MonoGame for this cross platform beam racing demo. Got it working on Mac and on Windows so far. Monogame uses OpenGL on Mac and uses Direct3D on Windows. But the API doesn't matter, you can use whatever framework you want. It simply must support tearlines.

As long as it has tearlines, it is beamraceable, and thusly "lagless VSYNC" is possible.
Repeat after me, "VSYNC OFF tearlines are just rasters!"

Yes, my cross platform beam racing demo going to be open source. My cross platform RasterCalculator & VsyncCalculator classes will be the useful modules for cross-platform "Lagless VSYNC" for other emulator authors (once ported to C/C++).

Keep tuned.

Last edited by mdrejhon; 17 April 2018 at 18:24.
mdrejhon is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Photos and/or measurements of Amiga 500 bLAZER request.Other 144 16 October 2018 01:40
A method for further improving latency (input lag) in FS-UAE Dr.Venom support.FS-UAE 4 12 September 2017 16:49
Optimizing DirectX apps for low latency input and longer battery life Dr.Venom support.WinUAE 2 24 April 2017 09:40
What are the measurements of Amiga 1200 case screws Tallrot support.Hardware 9 15 June 2016 10:04
A1200 and B1230 Voltage Measurements for Dummies? Jarin support.Hardware 2 23 January 2014 10:02

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 13:26.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.12490 seconds with 16 queries