Input latency measurements (and D3D11) - Page 11

mark_k · 04 April 2018, 22:10

Something I noticed when experimenting with full-window mode... I wanted to reduce Windows desktop resolution to (hopefully) allow more slices to be used. Monitor native res is 3840×2160 so I wanted to set desktop to 1920×1080.

This is on Windows 10 1709 with AMD graphics card. Windows is very keen on scaling its desktop to the monitor native res. If you go into Windows display settings and change resolution shown in the drop-down box from 3840 × 2160 to 1920 × 1080, the actual video signal output is still 3840×2160. That's even though I had disabled the GPU scaling option in Radeon settings.

In order to get a real 1920×1080 display as confirmed by monitor OSD info, I had to click Display adapter properties, then List All Modes, then choose 1920×1080 from the list there.

Anyway... Without setting a real 1920×1080 display mode (just changing res in Windows display settings), WinUAE's timing is messed up. It seems to detect the actual video signal being output and bases timings on that, whereas each of the 1080 scanlines is actually being output twice (more or less).

I'm not sure how good results would be with correct timings in that case, but here's some log output:

Code:

D3D11 init start. (1920*1080) (1920*1080) RTG=0 Depth=32.
CheckFeatureSupport(DXGI_FEATURE_PRESENT_ALLOW_TEARING) = 00000000 1
D3D11 found matching refresh rate 60/1=60.00. SLO=1
D3D11 Device: AMD FirePro W5100 [\\.\DISPLAY1] (0,0,1920,1080)
D3D11CreateDevice succeeded with level 11.1. Hardware accelerated.
D3D11 2 00000800 00000057
D3D11_resize 0 0 0 (0)
D3D11 init end
D3D11 resize do
D3D11_resize -> none
D3D11 resizemode start
D3D11 freed3d start
D3D11 freed3d end
D3D11 resizemode 1920x1080, 1920x1080 2 00000800 FS=-1
D3D11 initd3d start
-> -0.000000 0.000000 0.000000 -0.000000 0.000000 0.000000
D3D11 initd3d end
D3D11 1920x1080 main texture allocated
POS (0 0 1920 1080) - (0 0 480 270)[480,270] (720 405) S=1920*1080 B=1920*1080
0 0 4 4
-> -3840.000000 2160.000000 3840.000000 -2160.000000 4.000000 4.000000
D3D11 resizemode end
D3D11 resize exit
D3D11 376x288 main texture allocated
POS (0 0 1920 1080) - (-772 -396 -292 -126)[480,270] (720 405) S=1920*1080 B=376*288
0 0 4 4
-> -752.000000 576.000000 752.000000 -576.000000 4.000000 4.000000
Buffer size (376*288) Native
ActiveHeight: 2160 TotalHeight: 2222 VFreq=60/1=60.00Hz HFreq=533250000/4000=133.313KHz
Spincount = 17331
...
ActiveHeight: 2160 TotalHeight: 2222 VFreq=60/1=60.00Hz HFreq=533250000/4000=133.313KHz
PAL mode V=60.0000Hz H=15625.0879Hz (227x312+0) IDX=10 (PAL) D=0 RTG=0/0
D3D11 376x287 main texture allocated
POS (0 0 1920 1080) - (-772 -396 -292 -126)[480,270] (720 405) S=1920*1080 B=376*287
0 -1 4 4
-> -752.000000 580.000000 752.000000 -568.000000 4.000000 4.000000
Buffer size (376*287) Native
RTGFREQ: 312*60.0000 = 18720.0000 / 60.0 = 312
D3D11 Shader and extra textures restored

mdrejhon · 05 April 2018, 02:03

Quote:

Originally Posted by mark_k

DXGI 1.2 supports partial presentation, see Enhancing presentation with the flip model, dirty rectangles, and scrolled areas on MSDN. With that you'd use DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL or DXGI_SWAP_EFFECT_SEQUENTIAL and specify the dirty rectangle to be the just-rendered strip. (DXGI_SWAP_EFFECT_DISCARD might work as well, if Windows allows it. Since we don't care what's in the rest of the display, just the current strip.)

Also, could partial texture updates help? Use ID3D11DeviceContext::UpdateSubresource() to partially update the texture? Or maybe use multiple textures, one for each strip. Before each Present() you'd only update one texture.

Is there any way to visualise how much GPU bandwidth is being used, in real-time? (That can co-exist on-screen with WinUAE preferably.)

Interesting tip. I'll have to research this if this will increase frameslice throughput to achieve single-scanline-height frameslices.

At 7000 frameslices per second on 2560x1440 on my computer, that's up to 77 gigabytes per second of memory throughput (24bit framebuffers being blitted repeatedly over and over). Memory bandwidth is the bottleneck for VSYNC OFF based beam racing. But that can apparently be avoided (and also increasing frameslice throughput too!).

I've had success avoiding that memory bandwidth (to an extent) simply by not pre-clearing the framebuffer before rendering into it. VSYNC OFF still internally uses 2 different frame buffers, alternating between the two when flipped.

According to my tests when I leave my buffer uncleared and keep flipping the two, it alternates pre-existing buffer junk between two different buffers between the tearlines -- the front buffer becomes my back buffer, and the back buffer becomes the front buffer.

They retain their original content, so this is a bandwidth-saver piggyback opportunity. I see it because if I flip back and fourth without doing anything, it simply flashes between the most recently 2 frames rendered. This is a beneficial behaviour that can be piggybacked upon for more memory-bandwidth-optimized VSYNC OFF frameslice beamracing.

To make beam racing compatible with that, one could skip preclearing the buffers and blit the most recent two emulator frameslices only. That'll cover both of the framebuffers. But no guarantees that all VSYNC OFF implementations uses this "back<->front" buffer-trading algorithm. Probably best done as a configuration option for a "memory-bandwidth-reducing" setting. Will help make beam racing more compatible with laptop GPUs running off non-VRAM.

Shaders/fuzzylines/HLSL type stuff will have to be modified to be bleed-aware to prevent introducing artifacts at frameslice boundaries, but could still piggyback off this memory-bandwidth-saver trick.

We need to reuse the whole emulator frame buffer ideally, if we want a full refresh cycle of jitter safety margin (e.g. variable distance between emulator raster & real raster).

Toni Wilen · 05 April 2018, 13:04

Quote:

Originally Posted by mark_k

ActiveHeight: 2160 TotalHeight: 2222 VFreq=60/1=60.00Hz HFreq=533250000/4000=133.313KHz

This means mode is still scaled (2160 pixel height). Or someone is lying.

Quote:

Could there be any scope for reducing GPU memory bandwidth as this feature is refined? That could allow lower-end systems to benefit from this feature, and higher-end ones to use more slices.

Perhaps but I am not going to do anything complex because easiest solution is the usual: buy a better PC. Single slice is still much better than normal mode.

Quote:

DXGI 1.2 supports partial presentation, see

Isn't that more to do with non-flip mode where copying is needed? I don't see how partial presention can help with flipping.

Windowed mode (where flipping can't work) needs variable sync monitor anyway because no one uses 50Hz desktop = can't be too old PC. (100Hz also means not too old PC)

Partial texture copy can be done but it also won't help much because Amiga internal resolution is relatively small (768*568 or so in normal configuration).

I'll test if not clearing helps.

mark_k · 05 April 2018, 14:05

Quote:

Originally Posted by Toni Wilen

This means mode is still scaled (2160 pixel height). Or someone is lying.

Right, that log was from the case where Windows desktop is 1920×1080 but Windows (or GPU driver) scales it to 3840×2160 for output. You can detect cases like that since the full-window pixel height doesn't match the detected display height. Typically the scale factor would be a small integer or "nice" fraction (e.g. 1080⇒1440 is 4/3).

Quote:

Isn't that more to do with non-flip mode where copying is needed? I don't see how partial presention can help with flipping.

The MSDN page puts more emphasis on flip being the recommended way. That is, DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL being preferable to DXGI_SWAP_EFFECT_SEQUENTIAL. In connection with DWM it says "By using flip model, back buffers are flipped between the runtime and Desktop Window Manager (DWM), so DWM always composes directly from the back buffer instead of copying the back buffer content."

I assume that when composing each frame, the graphics driver can take the previous frame's image (in GPU memory) and just copy the small changed part to it, rather than copying the entire image from PC system memory?

Quote:

Windowed mode (where flipping can't work) needs variable sync monitor anyway because no one uses 50Hz desktop = can't be too old PC. (100Hz also means not too old PC)

It is possible in most cases to set desktop refresh rate to 50Hz. [But some monitors duplicate or drop frames so as to always refresh the panel at 60Hz...

So there it's kind of pointless.]

Did you mention before about being able to get tearing in windowed mode with a Nvidia graphics card? (On my AMD card I can get tearing in full-window and full-screen modes but not windowed.)

mdrejhon · 05 April 2018, 18:39

Quote:

Originally Posted by Toni Wilen

Partial texture copy can be done but it also won't help much because Amiga internal resolution is relatively small (768*568 or so in normal configuration).

But what about the final buffer after full shader effects are applied -- that makes it a much higher resolution frame buffer. Those are rather big framebuffers, aren't they? (The shader would obviously need to be slice-compatible, though)

(I will have to do benchmark tests if it improves things much or not....Lemmesee)

Toni Wilen · 05 April 2018, 20:23

CopySubresourceRegion() is used now. I didn't see anything different without ClearRenderTargetView() (Clearing is probably very optimized in GPUs because it is common operation). It is not disabled because it would also remove debug colors.

Quote:

Originally Posted by mdrejhon

But what about the final buffer after full shader effects are applied -- that makes it a much higher resolution frame buffer. Those are rather big framebuffers, aren't they? (The shader would obviously need to be slice-compatible, though)

Thats the big one. Many shaders are "non-linear" vertically, not sure if it can be done without adding large safety marging.

mdrejhon · 05 April 2018, 20:54

Quote:

Originally Posted by Toni Wilen

Thats the big one. Many shaders are "non-linear" vertically, not sure if it can be done without adding large safety marging.

Agreed, the shader would have to be frameslice-aware. Not easy programming.

That said, due to scanout, each frameslice is a lag gradient themselves.
-- 1 frameslice is [0...1.00/60sec lag]=16.7ms difference top edge to bottom edge of each
-- 2 frameslice is [0...0.50/60sec lag]=8.33ms difference top edge to bottom edge of each frameslice
-- 3 frameslice is [0...0.33/60sec lag]=5.55ms difference top edge to bottom edge of each frameslice
-- 4 frameslice is [0...0.25/60sec lag]=4.17ms difference top edge to bottom edge of each frameslice
- 10 frameslice is [0...0.10/60sec lag]=1.67ms difference top edge to bottom edge of each frameslice
(Lag granularity / lag gradients become finer, the more frameslices...)

Thusly:
10 frameslice + 108-line offset (2-frameslice lag instead of 1-frameslice lag)) = (1.67ms lag + 1.67ms lag) = 3.33ms lag in the emuraster-vs-realraster chase margin.

So a 108 line offset (between emu-raster and real-raster) isn't too bad, to compensate for things like curved-simulated CRTs and ambient bleed/fuzz effects. Like halos around bright parts. Long term, I think the chase distance between emuraster + realraster should probably be a slider adjustment while watching a horizontal-panning motion test, to observe for glitches/tearing artifacts.

Nontheless, it can be a low-priority "nice-to-have". Most modern NVIDIA/AMD GPUs in the last 5 years are powerful enough to just do a full shader re-render 4 times a refresh cycle. Even 4-frameslice is still one-quarter refresh cycle latency (75% lag reduction). Sub-frame rendering via real time beam racing + shader enabled = still beautifully viable with full shader re-renders. Pinball has never been this much fun in an emulator before today!

mark_k · 07 April 2018, 22:18

Perhaps another thing to try would be, instead of a full-size Amiga texture (752×576 or whatever) and partially updating that, have a smaller texture corresponding to one strip. That would be mirrored over the output surface (i.e. whole screen) with D3D11_TEXTURE_ADDRESS_MIRROR. Hopefully that mirroring would mean that shaders which sample from rows above/below each strip don't look too bad.

Maybe also multi-threaded Present() could help? [Assuming you don't do that already.] So have two thin strip-size textures. Emulate/render to one, tell other thread to Present(), then emulate/render into the other texture, tell other thread to Present() etc.

For shaders, could modifying them to do something like this help reduce memory bandwidth?

Code:

if (current_line < strip_top || current_line > strip_bottom)
    return black;
else
    {do whatever shader does normally}

Whether there's any performance difference and how much would vary depending on OS and graphics driver/hardware I guess.

Toni Wilen · 08 April 2018, 09:58

Having only Present() in separate thread won't work unless you guarantee no D3D calls are done by other threads when calling present. It also does not gain anything because normally Present() returns immediately. if it does not, it means previous rendering was not ready and present gets delayed, possibly causing glitches. (=too slow CPU or GPU). Threading won't fix that.

Default shader is very simple, it just sets does minimum needed to set pixel color (like pre-shader hardwired rasterizer would do), comparisons would only make it slower.

I am asking again: is this really worth the trouble? No one has mentioned about any major slow downs with ~3-4 slices. Lets first see what happens when first official beta is released.

Retro-Nerd · 08 April 2018, 20:00

Quote:

- Input is read after each slice.
- Default is 4 slices which makes max input latency about 6ms. (Assuming quality "gaming grade" USB input devices that use <=1ms USB rate.)

What about the input lag of the used monitors? This adds a bit to the total lag. So even with the best Acer Predator it's app. 9ms then, right? Or did i miss something?

mdrejhon · 09 April 2018, 05:22

Quote:

Originally Posted by Retro-Nerd

What about the input lag of the used monitors? This adds a bit to the total lag. So even with the best Acer Predator it's app. 9ms then, right? Or did i miss something?

Yes, monitor lag has a factor but most monitor lag tests use a "VSYNC ON" lag tester, so the numbers need to be interpreted differently.

9ms on DisplayLag.com = 2ms using beam racing

Click here for technical explanation

Most lag test websites measure using a "Leo Bodnar Lag Tester" device which is a 60 Hz VSYNC ON Lag Tester and the lag stopwatch is from VBI-to-screen-middle. Half of 1/60sec is (16.7ms)/2 = 8.3ms .... So a 60 Hz CRT has 8.3ms lag on CENTER square with a Leo Bodnar Lag Tester, because it's a 60 Hz VSYNC ON Lag Tester. The lag you will get with VSYNC OFF or beam racing (tearingless VSYNC OFF) is up to half-a-refresh-cycle less lag than the CENTER SQUARE of Leo Bodnar Lag Tester. With beam racing, all squares equallize in lag. The screen-CENTER test and BOTTOM-edge test has the same lag as TOP-edge. TOP(~2-3ms) == CENTER(~2-3ms) == BOTTOM(~2-3ms) on the current best low-lag LCD gaming monitors that do real time synchronous scanout (cable scan out = panel scan out). That's GtG overhead, as that several modern gaming LCD monitors has no monitor-side framebuffer delay (doing line-buffer processing) and it actually realtime scans the video signal nearly straight onto the LCD panel. I've confirmed this via high speed camera.

TL;DR: That particular Acer Predator is only ~1-2ms slower than the CRT via the Leo Bodnar lag test stopwatching criteria.

I'm the inventor of some lag testing techniques too, as I am the world's first person to successfully test the input lag of GSYNC (I wrote that in 2013!).
And I meet other lag testing websites (including DisplayLag.com too). So by now, you know I'm a known expert in display lag behaviours!

Retro-Nerd · 09 April 2018, 11:25

Thanks for the explanation. This makes the new method even more impressive.

ReadOnlyCat · 10 April 2018, 08:44

Quote:

Originally Posted by mdrejhon

Fellow software developer....

There's no buffer waiting in this new version even for mid-screen input reads.
It's sub-frame latency even for mid-screen input reads!!
Toni just successfully implemented tearingless VSYNC OFF

[snip, an unbearably long wall of text]

Fellow kitten, thank you for the detailed reply.

First of all I want to make clear that I understand what the method is about as well as all the subtleties of beam racing, page flips / present, vsync, how your method works and how Toni implemented it: producing pixels synchronously to the display refresh has been the dream of every video game coder even before the Amiga existed so this is not a new concept for this kitten.

Second, you may not have noticed but you have a tendency to repeat yourself several times in your posts which makes them very difficult to read. I really recommend that you take your time eliminating redundant information and jargon and simplifying.
I cannot blame you, I was like you before, but believe me: the less you say, the more people understand and listen.

Now onto my main point:

Quote:

It's like bufferless VSYNC ON.
Tearlines are simply just rasters. We're simply raster-timing the tearlines out of the way
Toni just successfully implemented raster-synchronized VSYNC OFF.
Tearing never appears

I was talking about hypothetical tearlines which *would* appear *if* one started to display the next frame mid raster, which is not what your method is about. I was not actually implying that your method produces tearlines.

But this is actually secondary, so I will not delve ont this point.

Quote:

To genuinely get less lag than the original machine, you simply "cheat" by doing surge-execute cycles followed by surge-scanout cycles. For the average joe user, nothing amiss is noticed if the frames output at their regular rate -- but the emulator can cheat via accelerated beam-racing an accelerated-scanout display. Basically running Amiga CPU 4x faster while real-time streaming pixels out to a 4x-faster-scanout display ("60Hz" refresh cycles scanned out in 1/240sec ... with a pause of 3/240sec before doing the next surge-execute). It's now possible with the latest WinUAE beta, I just helped Toni make beam racing VRR compatible...

There are simpler ways to say the same thing.

I had to parse this sentence several times to understand what you actually meant just because of the surge-scanout-stuff keywords. They sound nice but they hide the simplicity of the method.

To rephrase it (correct me if I am wrong):

Since the display refresh rate is 4x that of the emulated machine's and the emulator can also emulate 4x faster, we emulate the current frame in 1/240s, and we immediately present the resulting frame buffer to the 240Hz display, then we keep that frame constant for 3/240s.

Essentially what this does is compress the Amiga frame execution time to 1/4th of its duration and then idle until the next frame needs to be executed. All while keeping the rate of visual frames output at a steady non jittering 60Hz.

Quote:

That shortens the time between input read and pixels on the screen, no matter where the input read is -- as long as the input read is somewhere further up from the bottom of the screen. If the input read is during the VBI, there is up to 1/60sec less lag than the original machine in the "infinite-fast-surge-execute" situation. If the input read is during the center-of-screen, there is 0.5/60sec less lag than the original machine.

That is not correct.

This shortens the time between input changes and pixel changes for all inputs occurring in the top 1/4th of the Amiga frame.
For all inputs which occur after that 1/4th of a frame it introduces an additional latency of one full 60Hz frame since they will be taken into account at the next 60Hz frame.

This does not reduce latency at all, but gives it a bias toward the beginning of the (Amiga) frame, which is very different. Essentially this moves the input scanning point to the top of the Amiga frame rather than for example the mid screen.

In order to claim that this benefits you must assume that most inputs occur in the top 1/4th of the screen of an Amiga frame which is frankly dubious given that inputs are equally likely to occur at any moment during a frame.
Humans are not precise enough to time action in the top quarter of a 60Hz frame, especially in the digital domain (*).

And this is especially true given that CRT images are painted progressively so the most important elements of an image are not even visible at the top of the screen: even if visual-to-hand reflexes had sub frame granularity, they would most likely occur at mid-screen anyway when the information is available.

Quote:

Now, we can combine VRR and beam racing for realistic "less-than-original" lag.

If that system works, it is by definition not realistic in any way since the original machine did not behave like that.

Quote:

P.S. I'm located in Canada (Hamilton, specifically). So you code in Montreal?

My employer indeed pays me to do so.

That is all for today, thanks for bearing with me. (ˆˆ)/

(*) in the analog domain, there is some evidence that they can at least detect (as opposed to respond to) very fine lag (millisecond) but this is likely an indirect detection made possible by the easier predictability of the response to analog movements.

mdrejhon · 10 April 2018, 18:46

Quote: