Create a precise_delay_usec function to avoid excessive CPU penalty when not intended #99656

mrsaturnsan · 2024-11-24T23:26:17Z

Avoids the excess CPU penalty for non-frame pacing related sleeps as mentioned here:

RandomShaper · 2024-11-25T07:17:05Z

Makes sense.

I'd just want to suggest ways to trying not to clutter the internal, neither the exposed APIs before we know the precise delays are needed by projects. Options (not necessarily exclusive):
a) Not exposing the precise ones.
b) Not having an msec version of the precise one, at all.
c) Refactoring the precise one so it's only related to frame pacing. Is there any other known usage for it yet?
d) To avoid confusion, maybe the precise one could be renamed to something else. The docs can already explain the difference, but I can foresee many would just prefer the precise version just because it looks like the best choice.

doc/classes/OS.xml

Calinou

Please also add this note at the end of the non-precise delay_usec documentation:

[b]Note:[/b] On Windows, sleep durations of less than 1 millisecond are not supported due to OS limitations. This means that values lower than [code]1000[/code] will be treated as [code]1000[/code] and sleep for 1 millisecond.

Faless · 2024-11-25T13:22:05Z

a) Not exposing the precise ones.

If I understand correctly, this is a Windows only issue.

The PR that introduced #99593 also wasn't fixing any issue (i.e. had no issue associated with it).

So, if there's a problem on frame pacing on windows like the PR that introduced #99593 claims, that should be handled at the platform level (and have a dedicated issue attached, instead of linking to a blog post).

We definitely should not bloat and over-complicate the public API because of some weird windows quirk (which should instead just be documented).

mrsaturnsan · 2024-11-25T13:33:48Z

delay_usec is also not precise on Unix. I tested it on macOS, and it was consistently off by 4-5ms.
With this PR, the accuracy of precise_delay_usec on Unix is also spot on.

Another benefit of this PR is the precise functions ensure consistent behavior across platforms and threads.

mrsaturnsan · 2024-11-25T13:42:29Z

@Calinou
Made the requested changes.

mrsaturnsan · 2024-11-25T13:50:20Z

delay_usec accuracy on macOS:

Testing delay_usec with delay of 1000000 microseconds...

Expected delay: 1000.0000 milliseconds
Actual delay: 1005.0240 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1005.0290 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1003.4890 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1005.0500 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1005.0520 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1005.0500 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1001.2210 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1005.0490 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1005.0590 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1001.4000 milliseconds

===============================

Testing delay_usec with delay of 850 microseconds...

Expected delay: 0.8500 milliseconds
Actual delay: 1.0690 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 1.0660 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 1.0680 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 1.0660 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 1.0670 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 1.0660 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 1.0670 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 1.0660 milliseconds

mrsaturnsan · 2024-11-25T13:52:36Z

precise_delay_usec accuracy on macOS:

Testing precise_delay_usec with delay of 1000000 microseconds...

Expected delay: 1000.0000 milliseconds
Actual delay: 1000.0010 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1000.0000 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1000.0030 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1000.0000 milliseconds

===============================

Testing precise_delay_usec with delay of 850 microseconds...

Expected delay: 0.8500 milliseconds
Actual delay: 0.8500 milliseconds

Faless · 2024-11-25T14:21:58Z

delay_usec is also not precise on Unix. I tested it on macOS, and it was consistently off by 4-5ms.
With this PR, the accuracy of precise_delay_usec on Unix is also spot on.

~~Can you also post the results?~~ EDIT: Well, GH is playing some tricks, I see the results, still...

I really don't like that we are having PRs without proper issues, MRPs, etc.

The previous changes already caused a regression, I'd really like to be able to evaluate the issue instead of YOLO-merging fixes without proper data.

mrsaturnsan · 2024-11-25T14:59:24Z

@Faless
The previous PR has results from @Calinou towards the bottom. The delay_usec function previously was unstable, and that was causing FPS limiting issues. delay_usec is being used for limiting FPS on all platforms that don't provide a separate renderer based limiter, not just Windows.

As stated, the issue can be reproduced by attempting to limit frame-rate.

#99178

mrsaturnsan · 2024-11-25T15:19:06Z

The goal of the PR is as follows:

Fix the CPU performance regression.
Maintain an accurate delay function.

The CPU performance regression was caused by other parts of the engine calling delay_usec with small delays which made the CPU spin to maximize accuracy where not needed.

By splitting the delay_usec function into an imprecise and precise function, we can limit the CPU cost to where it is really needed instead of bloating one function that needs to predict how accurate the user wants to sleep for.

Faless · 2024-11-25T15:32:17Z

The goal of the PR is as follows:
1. Fix the CPU performance regression.

2. Maintain an accurate delay function.
The CPU performance regression was caused by other parts of the engine calling delay_usec with small delays which made the CPU spin to maximize accuracy where not needed.

And this is exactly why we need issues, so we can discuss a proper solution in there, instead of splitting it across 3 different PRs, each implementing half-baked solutions.

mrsaturnsan · 2024-11-25T15:35:53Z

What is your suggestion for a non "half-baked" solution for the issue? The frame-limiting logic is not accurate because of delay_usec.
If you would like to keep the old behavior, we would need to still create new functions for each platform to limit the frame-rate.

Faless · 2024-11-25T15:38:12Z

If you would like to keep the old behavior, we would need to still create new functions for each platform to limit the frame-rate.

There's no reason to expose that function to scripting, bloating the exposed API, and confusing the end user.

Also, we already handle frame delays in a dedicated function (OS::add_frame_delay) that is OS specific, we should try to keep it contained there.

mrsaturnsan · 2024-11-25T15:49:14Z

@Faless
I removed exposing the precise functions in the PR. All OSs currently use OS::add_frame_delay in some form so just adding an internal precise function should work with it.

Faless · 2024-11-25T15:55:22Z

I removed exposing the precise functions in the PR. All OSs currently use OS::add_frame_delay in some form so just adding an internal precise function should work with it.

Well, actually, the Web platform implements frame skipping and does not use add_frame_delay so the function is useless there (except when PROXY_TO_PTHREAD_ENABLED which is currently broken so it's not used in official builds).

But this is at least an improvement.

Faless

I'm still not convinced we need a dedicated function for this, can't we implement a simple busy wait logic in add_frame_delay using the functions already at our disposal?

drivers/unix/os_unix.cpp

Faless · 2024-11-25T16:00:19Z

drivers/unix/os_unix.cpp

+	// Perform coarse sleep while the remaining time exceeds the estimate.
+	while (usec_d > estimate) {
+		uint64_t sleep_start = get_ticks_usec();
+		delay_usec(1000);
+		uint64_t sleep_end = get_ticks_usec();
+
+		// Compute observed time and limit to 1 second.
+		double observed = sleep_end - sleep_start;
+		usec_d -= observed;
+		observed = MIN(observed, 1000000.0);
+
+		// Update statistical estimates for mean and standard deviation.
+		if (unlikely(count > 1000000)) {
+			mean = (mean + observed) / 2;
+			m2 = 0.0;
+			count = 1;
+		}
+
+		++count;
+		double delta = observed - mean;
+		mean += delta / count;
+		m2 += delta * (observed - mean);
+		double stddev = Math::sqrt(m2 / (count - 1));
+		estimate = mean + stddev;
+	}


Do we need all these logic if we need the sleep is always going to be sub 1 second?

The units are in microseconds so it is necessary.

A simple busy wait would burn more CPU cycles than necessary. You need to sleep coarsely first, and then spin the rest of the time. For the coarse sleep, you would need to change the logic based on platform.

Let's say frame time is 19ms/frame, and you want to limit to 30FPS (33.33ms). That means you would end up spinning 14.33ms on the main thread.

Can't the function assume some upper bound? E.g. sleep 1 or 2 ms less than what it should and busy wait only that 1 or 2 ms? Do we have any data on how better/worse that would perform?

Or... given the current implementation runs based on estimates, why does the code that runs the estimates need to be platform-dependent?

I don't think so. On macOS, I see errors going up to several milliseconds, and this is just on one machine. Based on OS and hardware, I don't think we can assume a single upper bound unless we make it so conservative that we just end up wasting cycles spinning.

Windows explicitly calls timeBeginPeriod(1), which gives a definite upper bound.

So, given this is going to be hardware dependent, should we instead use the estimate logic everywhere (and just move it inside add_frame_delay instead of creating a dedicated function in each platform)?

We could do that, but I think it would be better on Windows to avoid the estimate tracking. Should I move the logic for Windows/Unix into one function and add an #if branch? I can imagine some platforms could have quirks with the delay. But reimplementing add_frame_delay for each platform with minor changes seems like too much code duplication, which is why the current method uses a virtual function to isolate per platform logic.

Another benefit is having a separate precise function is it could be reused in the future for any code that requires high precision.

…hen not intended

core/os/spin_lock.h

akien-mga · 2024-11-25T18:38:43Z

Thanks for the work on fixing the regression and improving the implementation further.

From a release management perspective, so we don't rush a significant change as a hotfix for a previous regression, I would suggest we start by reverting #99178 to fix #99593, and then this PR can be rebased to redo the changes from #99178 plus the further changes here.

This is mainly to restart from a known state, and maybe reduce some of the tension I perceive that #99178 was maybe merged a bit too eagerly without fully assessing its consequences (which is now being done here). That's not to say that the change isn't wanted though, it does seem to fix a significant issue.

Does that sound ok? If so I'll take care of reverting #99178.

mrsaturnsan · 2024-11-25T18:42:40Z

@akien-mga

Yes, that sounds reasonable! If you could revert that PR, I'll rebase this on top. Thanks 🙏

mrsaturnsan requested review from a team as code owners November 24, 2024 23:26

mrsaturnsan mentioned this pull request Nov 24, 2024

Huge CPU consumption comparing in 4.4.dev5 compared to 4.3.stable #99593

Open

tetrapod00 changed the title ~~Create a precise_delay_usec function to avoid excessive CPU penalty w…~~ Create a precise_delay_usec function to avoid excessive CPU penalty when not intended Nov 24, 2024

tetrapod00 added bug platform:windows performance labels Nov 24, 2024

tetrapod00 added this to the 4.4 milestone Nov 24, 2024

tetrapod00 added the regression label Nov 24, 2024

mrsaturnsan force-pushed the precise_sleep branch from 46dce67 to 5dbc8ea Compare November 25, 2024 01:46

mrsaturnsan requested a review from a team as a code owner November 25, 2024 01:46

mrsaturnsan force-pushed the precise_sleep branch 15 times, most recently from 41b6414 to 0e20fc8 Compare November 25, 2024 06:19

Calinou added the topic:porting label Nov 25, 2024

Calinou reviewed Nov 25, 2024

View reviewed changes

doc/classes/OS.xml Outdated Show resolved Hide resolved

Calinou reviewed Nov 25, 2024

View reviewed changes

doc/classes/OS.xml Outdated Show resolved Hide resolved

Calinou reviewed Nov 25, 2024

View reviewed changes

mrsaturnsan force-pushed the precise_sleep branch from 17e3838 to 1657a40 Compare November 25, 2024 13:41

mrsaturnsan force-pushed the precise_sleep branch from 1657a40 to ca3c7d5 Compare November 25, 2024 15:47

Faless reviewed Nov 25, 2024

View reviewed changes

Create a precise_delay_usec function to avoid excessive CPU penalty w…

8a54512

…hen not intended

mrsaturnsan force-pushed the precise_sleep branch from ca3c7d5 to 8a54512 Compare November 25, 2024 16:12

adamscott reviewed Nov 25, 2024

View reviewed changes

core/os/spin_lock.h Show resolved Hide resolved

akien-mga mentioned this pull request Nov 25, 2024

Revert "Make delay_usec more precise on Windows to fix framepacing" #99688

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a precise_delay_usec function to avoid excessive CPU penalty when not intended #99656

Create a precise_delay_usec function to avoid excessive CPU penalty when not intended #99656

Create a precise_delay_usec function to avoid excessive CPU penalty when not intended #99656

Are you sure you want to change the base?

Create a precise_delay_usec function to avoid excessive CPU penalty when not intended #99656

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment