-
-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a precise_delay_usec function to avoid excessive CPU penalty when not intended #99656
base: master
Are you sure you want to change the base?
Conversation
46dce67
to
5dbc8ea
Compare
41b6414
to
0e20fc8
Compare
Makes sense. I'd just want to suggest ways to trying not to clutter the internal, neither the exposed APIs before we know the precise delays are needed by projects. Options (not necessarily exclusive): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add this note at the end of the non-precise delay_usec
documentation:
[b]Note:[/b] On Windows, sleep durations of less than 1 millisecond are not supported due to OS limitations. This means that values lower than [code]1000[/code] will be treated as [code]1000[/code] and sleep for 1 millisecond.
If I understand correctly, this is a Windows only issue. The PR that introduced #99593 also wasn't fixing any issue (i.e. had no issue associated with it). So, if there's a problem on frame pacing on windows like the PR that introduced #99593 claims, that should be handled at the platform level (and have a dedicated issue attached, instead of linking to a blog post). We definitely should not bloat and over-complicate the public API because of some weird windows quirk (which should instead just be documented). |
delay_usec is also not precise on Unix. I tested it on macOS, and it was consistently off by 4-5ms. Another benefit of this PR is the precise functions ensure consistent behavior across platforms and threads. |
17e3838
to
1657a40
Compare
@Calinou |
Testing delay_usec with delay of 1000000 microseconds... Expected delay: 1000.0000 milliseconds Expected delay: 1000.0000 milliseconds Expected delay: 1000.0000 milliseconds Expected delay: 1000.0000 milliseconds Expected delay: 1000.0000 milliseconds Expected delay: 1000.0000 milliseconds Expected delay: 1000.0000 milliseconds Expected delay: 1000.0000 milliseconds Expected delay: 1000.0000 milliseconds Expected delay: 1000.0000 milliseconds =============================== Testing delay_usec with delay of 850 microseconds... Expected delay: 0.8500 milliseconds Expected delay: 0.8500 milliseconds Expected delay: 0.8500 milliseconds Expected delay: 0.8500 milliseconds Expected delay: 0.8500 milliseconds Expected delay: 0.8500 milliseconds Expected delay: 0.8500 milliseconds Expected delay: 0.8500 milliseconds Expected delay: 0.8500 milliseconds Expected delay: 0.8500 milliseconds |
Testing precise_delay_usec with delay of 1000000 microseconds... Expected delay: 1000.0000 milliseconds Expected delay: 1000.0000 milliseconds Expected delay: 1000.0000 milliseconds Expected delay: 1000.0000 milliseconds Expected delay: 1000.0000 milliseconds Expected delay: 1000.0000 milliseconds Expected delay: 1000.0000 milliseconds Expected delay: 1000.0000 milliseconds Expected delay: 1000.0000 milliseconds Expected delay: 1000.0000 milliseconds =============================== Testing precise_delay_usec with delay of 850 microseconds... Expected delay: 0.8500 milliseconds Expected delay: 0.8500 milliseconds Expected delay: 0.8500 milliseconds Expected delay: 0.8500 milliseconds Expected delay: 0.8500 milliseconds Expected delay: 0.8500 milliseconds Expected delay: 0.8500 milliseconds Expected delay: 0.8500 milliseconds Expected delay: 0.8500 milliseconds Expected delay: 0.8500 milliseconds |
I really don't like that we are having PRs without proper issues, MRPs, etc. The previous changes already caused a regression, I'd really like to be able to evaluate the issue instead of YOLO-merging fixes without proper data. |
@Faless As stated, the issue can be reproduced by attempting to limit frame-rate. |
The goal of the PR is as follows:
The CPU performance regression was caused by other parts of the engine calling By splitting the |
And this is exactly why we need issues, so we can discuss a proper solution in there, instead of splitting it across 3 different PRs, each implementing half-baked solutions. |
What is your suggestion for a non "half-baked" solution for the issue? The frame-limiting logic is not accurate because of |
There's no reason to expose that function to scripting, bloating the exposed API, and confusing the end user. Also, we already handle frame delays in a dedicated function ( |
1657a40
to
ca3c7d5
Compare
@Faless |
Well, actually, the Web platform implements frame skipping and does not use But this is at least an improvement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not convinced we need a dedicated function for this, can't we implement a simple busy wait logic in add_frame_delay
using the functions already at our disposal?
// Perform coarse sleep while the remaining time exceeds the estimate. | ||
while (usec_d > estimate) { | ||
uint64_t sleep_start = get_ticks_usec(); | ||
delay_usec(1000); | ||
uint64_t sleep_end = get_ticks_usec(); | ||
|
||
// Compute observed time and limit to 1 second. | ||
double observed = sleep_end - sleep_start; | ||
usec_d -= observed; | ||
observed = MIN(observed, 1000000.0); | ||
|
||
// Update statistical estimates for mean and standard deviation. | ||
if (unlikely(count > 1000000)) { | ||
mean = (mean + observed) / 2; | ||
m2 = 0.0; | ||
count = 1; | ||
} | ||
|
||
++count; | ||
double delta = observed - mean; | ||
mean += delta / count; | ||
m2 += delta * (observed - mean); | ||
double stddev = Math::sqrt(m2 / (count - 1)); | ||
estimate = mean + stddev; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need all these logic if we need the sleep is always going to be sub 1 second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The units are in microseconds so it is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simple busy wait would burn more CPU cycles than necessary. You need to sleep coarsely first, and then spin the rest of the time. For the coarse sleep, you would need to change the logic based on platform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say frame time is 19ms/frame, and you want to limit to 30FPS (33.33ms). That means you would end up spinning 14.33ms on the main thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't the function assume some upper bound? E.g. sleep 1 or 2 ms less than what it should and busy wait only that 1 or 2 ms? Do we have any data on how better/worse that would perform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or... given the current implementation runs based on estimates, why does the code that runs the estimates need to be platform-dependent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. On macOS, I see errors going up to several milliseconds, and this is just on one machine. Based on OS and hardware, I don't think we can assume a single upper bound unless we make it so conservative that we just end up wasting cycles spinning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Windows explicitly calls timeBeginPeriod(1), which gives a definite upper bound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, given this is going to be hardware dependent, should we instead use the estimate logic everywhere (and just move it inside add_frame_delay
instead of creating a dedicated function in each platform)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that, but I think it would be better on Windows to avoid the estimate tracking. Should I move the logic for Windows/Unix into one function and add an #if branch? I can imagine some platforms could have quirks with the delay. But reimplementing add_frame_delay for each platform with minor changes seems like too much code duplication, which is why the current method uses a virtual function to isolate per platform logic.
Another benefit is having a separate precise function is it could be reused in the future for any code that requires high precision.
ca3c7d5
to
8a54512
Compare
Thanks for the work on fixing the regression and improving the implementation further. From a release management perspective, so we don't rush a significant change as a hotfix for a previous regression, I would suggest we start by reverting #99178 to fix #99593, and then this PR can be rebased to redo the changes from #99178 plus the further changes here. This is mainly to restart from a known state, and maybe reduce some of the tension I perceive that #99178 was maybe merged a bit too eagerly without fully assessing its consequences (which is now being done here). That's not to say that the change isn't wanted though, it does seem to fix a significant issue. Does that sound ok? If so I'll take care of reverting #99178. |
Yes, that sounds reasonable! If you could revert that PR, I'll rebase this on top. Thanks 🙏 |
Avoids the excess CPU penalty for non-frame pacing related sleeps as mentioned here:
#99593