[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a precise_delay_usec function to avoid excessive CPU penalty when not intended #99656

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mrsaturnsan
Copy link
Contributor
@mrsaturnsan mrsaturnsan commented Nov 24, 2024

Avoids the excess CPU penalty for non-frame pacing related sleeps as mentioned here:

#99593

@mrsaturnsan mrsaturnsan requested review from a team as code owners November 24, 2024 23:26
@tetrapod00 tetrapod00 changed the title Create a precise_delay_usec function to avoid excessive CPU penalty w… Create a precise_delay_usec function to avoid excessive CPU penalty when not intended Nov 24, 2024
@tetrapod00 tetrapod00 added this to the 4.4 milestone Nov 24, 2024
@mrsaturnsan mrsaturnsan requested a review from a team as a code owner November 25, 2024 01:46
@mrsaturnsan mrsaturnsan force-pushed the precise_sleep branch 15 times, most recently from 41b6414 to 0e20fc8 Compare November 25, 2024 06:19
@RandomShaper
Copy link
Member

Makes sense.

I'd just want to suggest ways to trying not to clutter the internal, neither the exposed APIs before we know the precise delays are needed by projects. Options (not necessarily exclusive):
a) Not exposing the precise ones.
b) Not having an msec version of the precise one, at all.
c) Refactoring the precise one so it's only related to frame pacing. Is there any other known usage for it yet?
d) To avoid confusion, maybe the precise one could be renamed to something else. The docs can already explain the difference, but I can foresee many would just prefer the precise version just because it looks like the best choice.

doc/classes/OS.xml Outdated Show resolved Hide resolved
doc/classes/OS.xml Outdated Show resolved Hide resolved
Copy link
Member
@Calinou Calinou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add this note at the end of the non-precise delay_usec documentation:

[b]Note:[/b] On Windows, sleep durations of less than 1 millisecond are not supported due to OS limitations. This means that values lower than [code]1000[/code] will be treated as [code]1000[/code] and sleep for 1 millisecond.

@Faless
Copy link
Collaborator
Faless commented Nov 25, 2024

a) Not exposing the precise ones.

If I understand correctly, this is a Windows only issue.

The PR that introduced #99593 also wasn't fixing any issue (i.e. had no issue associated with it).

So, if there's a problem on frame pacing on windows like the PR that introduced #99593 claims, that should be handled at the platform level (and have a dedicated issue attached, instead of linking to a blog post).

We definitely should not bloat and over-complicate the public API because of some weird windows quirk (which should instead just be documented).

@mrsaturnsan
Copy link
Contributor Author
mrsaturnsan commented Nov 25, 2024

delay_usec is also not precise on Unix. I tested it on macOS, and it was consistently off by 4-5ms.
With this PR, the accuracy of precise_delay_usec on Unix is also spot on.

Another benefit of this PR is the precise functions ensure consistent behavior across platforms and threads.

@mrsaturnsan
Copy link
Contributor Author

@Calinou
Made the requested changes.

@mrsaturnsan
Copy link
Contributor Author
mrsaturnsan commented Nov 25, 2024

delay_usec accuracy on macOS:

Testing delay_usec with delay of 1000000 microseconds...

Expected delay: 1000.0000 milliseconds
Actual delay: 1005.0240 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1005.0290 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1003.4890 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1005.0500 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1005.0520 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1005.0500 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1001.2210 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1005.0490 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1005.0590 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1001.4000 milliseconds

===============================

Testing delay_usec with delay of 850 microseconds...

Expected delay: 0.8500 milliseconds
Actual delay: 1.0690 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 1.0660 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 1.0680 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 1.0660 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 1.0660 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 1.0670 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 1.0670 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 1.0660 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 1.0670 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 1.0660 milliseconds

@mrsaturnsan
Copy link
Contributor Author
mrsaturnsan commented Nov 25, 2024

precise_delay_usec accuracy on macOS:

Testing precise_delay_usec with delay of 1000000 microseconds...

Expected delay: 1000.0000 milliseconds
Actual delay: 1000.0010 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1000.0000 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1000.0000 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1000.0000 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1000.0030 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1000.0000 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1000.0000 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1000.0000 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1000.0000 milliseconds

Expected delay: 1000.0000 milliseconds
Actual delay: 1000.0000 milliseconds

===============================

Testing precise_delay_usec with delay of 850 microseconds...

Expected delay: 0.8500 milliseconds
Actual delay: 0.8500 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 0.8500 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 0.8500 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 0.8500 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 0.8500 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 0.8500 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 0.8500 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 0.8500 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 0.8500 milliseconds

Expected delay: 0.8500 milliseconds
Actual delay: 0.8500 milliseconds

@Faless
Copy link
Collaborator
Faless commented Nov 25, 2024

delay_usec is also not precise on Unix. I tested it on macOS, and it was consistently off by 4-5ms.
With this PR, the accuracy of precise_delay_usec on Unix is also spot on.

Can you also post the results? EDIT: Well, GH is playing some tricks, I see the results, still...

I really don't like that we are having PRs without proper issues, MRPs, etc.

The previous changes already caused a regression, I'd really like to be able to evaluate the issue instead of YOLO-merging fixes without proper data.

@mrsaturnsan
Copy link
Contributor Author
mrsaturnsan commented Nov 25, 2024

@Faless
The previous PR has results from @Calinou towards the bottom. The delay_usec function previously was unstable, and that was causing FPS limiting issues. delay_usec is being used for limiting FPS on all platforms that don't provide a separate renderer based limiter, not just Windows.

As stated, the issue can be reproduced by attempting to limit frame-rate.

#99178

@mrsaturnsan
Copy link
Contributor Author

The goal of the PR is as follows:

  1. Fix the CPU performance regression.
  2. Maintain an accurate delay function.

The CPU performance regression was caused by other parts of the engine calling delay_usec with small delays which made the CPU spin to maximize accuracy where not needed.

By splitting the delay_usec function into an imprecise and precise function, we can limit the CPU cost to where it is really needed instead of bloating one function that needs to predict how accurate the user wants to sleep for.

@Faless
Copy link
Collaborator
Faless commented Nov 25, 2024

The goal of the PR is as follows:

1. Fix the CPU performance regression.

2. Maintain an accurate delay function.

The CPU performance regression was caused by other parts of the engine calling delay_usec with small delays which made the CPU spin to maximize accuracy where not needed.

And this is exactly why we need issues, so we can discuss a proper solution in there, instead of splitting it across 3 different PRs, each implementing half-baked solutions.

@mrsaturnsan
Copy link
Contributor Author
mrsaturnsan commented Nov 25, 2024

What is your suggestion for a non "half-baked" solution for the issue? The frame-limiting logic is not accurate because of delay_usec.
If you would like to keep the old behavior, we would need to still create new functions for each platform to limit the frame-rate.

@Faless
Copy link
Collaborator
Faless commented Nov 25, 2024

If you would like to keep the old behavior, we would need to still create new functions for each platform to limit the frame-rate.

There's no reason to expose that function to scripting, bloating the exposed API, and confusing the end user.

Also, we already handle frame delays in a dedicated function (OS::add_frame_delay) that is OS specific, we should try to keep it contained there.

@mrsaturnsan
Copy link
Contributor Author

@Faless
I removed exposing the precise functions in the PR. All OSs currently use OS::add_frame_delay in some form so just adding an internal precise function should work with it.

@Faless
Copy link
Collaborator
Faless commented Nov 25, 2024

I removed exposing the precise functions in the PR. All OSs currently use OS::add_frame_delay in some form so just adding an internal precise function should work with it.

Well, actually, the Web platform implements frame skipping and does not use add_frame_delay so the function is useless there (except when PROXY_TO_PTHREAD_ENABLED which is currently broken so it's not used in official builds).

But this is at least an improvement.

Copy link
Collaborator
@Faless Faless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not convinced we need a dedicated function for this, can't we implement a simple busy wait logic in add_frame_delay using the functions already at our disposal?

drivers/unix/os_unix.cpp Outdated Show resolved Hide resolved
Comment on lines +388 to +412
// Perform coarse sleep while the remaining time exceeds the estimate.
while (usec_d > estimate) {
uint64_t sleep_start = get_ticks_usec();
delay_usec(1000);
uint64_t sleep_end = get_ticks_usec();

// Compute observed time and limit to 1 second.
double observed = sleep_end - sleep_start;
usec_d -= observed;
observed = MIN(observed, 1000000.0);

// Update statistical estimates for mean and standard deviation.
if (unlikely(count > 1000000)) {
mean = (mean + observed) / 2;
m2 = 0.0;
count = 1;
}

++count;
double delta = observed - mean;
mean += delta / count;
m2 += delta * (observed - mean);
double stddev = Math::sqrt(m2 / (count - 1));
estimate = mean + stddev;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all these logic if we need the sleep is always going to be sub 1 second?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The units are in microseconds so it is necessary.

Copy link
Contributor Author
@mrsaturnsan mrsaturnsan Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple busy wait would burn more CPU cycles than necessary. You need to sleep coarsely first, and then spin the rest of the time. For the coarse sleep, you would need to change the logic based on platform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say frame time is 19ms/frame, and you want to limit to 30FPS (33.33ms). That means you would end up spinning 14.33ms on the main thread.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't the function assume some upper bound? E.g. sleep 1 or 2 ms less than what it should and busy wait only that 1 or 2 ms? Do we have any data on how better/worse that would perform?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or... given the current implementation runs based on estimates, why does the code that runs the estimates need to be platform-dependent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. On macOS, I see errors going up to several milliseconds, and this is just on one machine. Based on OS and hardware, I don't think we can assume a single upper bound unless we make it so conservative that we just end up wasting cycles spinning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows explicitly calls timeBeginPeriod(1), which gives a definite upper bound.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, given this is going to be hardware dependent, should we instead use the estimate logic everywhere (and just move it inside add_frame_delay instead of creating a dedicated function in each platform)?

Copy link
Contributor Author
@mrsaturnsan mrsaturnsan Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that, but I think it would be better on Windows to avoid the estimate tracking. Should I move the logic for Windows/Unix into one function and add an #if branch? I can imagine some platforms could have quirks with the delay. But reimplementing add_frame_delay for each platform with minor changes seems like too much code duplication, which is why the current method uses a virtual function to isolate per platform logic.

Another benefit is having a separate precise function is it could be reused in the future for any code that requires high precision.

@akien-mga
Copy link
Member

Thanks for the work on fixing the regression and improving the implementation further.

From a release management perspective, so we don't rush a significant change as a hotfix for a previous regression, I would suggest we start by reverting #99178 to fix #99593, and then this PR can be rebased to redo the changes from #99178 plus the further changes here.

This is mainly to restart from a known state, and maybe reduce some of the tension I perceive that #99178 was maybe merged a bit too eagerly without fully assessing its consequences (which is now being done here). That's not to say that the change isn't wanted though, it does seem to fix a significant issue.

Does that sound ok? If so I'll take care of reverting #99178.

@mrsaturnsan
Copy link
Contributor Author

@akien-mga

Yes, that sounds reasonable! If you could revert that PR, I'll rebase this on top. Thanks 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants