-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reintroduce grpclb re-resolution with fix #14438
Merged
AspirinSJL
merged 2 commits into
master
from
revert-14412-revert-14376-revert-14356-revert-13671-grpclb_reresolution
Feb 20, 2018
Merged
Reintroduce grpclb re-resolution with fix #14438
AspirinSJL
merged 2 commits into
master
from
revert-14412-revert-14376-revert-14356-revert-13671-grpclb_reresolution
Feb 20, 2018
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
AspirinSJL
requested review from
a11r,
dgquintas,
markdroth,
vjpai,
y-zeng and
yang-g
as code owners
February 15, 2018 17:43
|
vjpai
suggested changes
Feb 15, 2018
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix PR title to indicate that this is the fix PR, and update the PR description to say the changes since the last attempt. Approval for changes in test/cpp/end2end
|
AspirinSJL
changed the title
Revert "Revert "Revert "Revert "grpclb re-resolution""""
Readd grpclb re-resolution with fix
Feb 15, 2018
|
|
|
dgquintas
changed the title
Readd grpclb re-resolution with fix
Reintroduce grpclb re-resolution with fix
Feb 15, 2018
@vjpai Done. |
|
vjpai
approved these changes
Feb 15, 2018
|
|
|
markdroth
approved these changes
Feb 16, 2018
dgquintas
approved these changes
Feb 16, 2018
AspirinSJL
force-pushed
the
revert-14412-revert-14376-revert-14356-revert-13671-grpclb_reresolution
branch
from
February 16, 2018 22:29
baedcfb
to
10142e4
Compare
|
|
|
|
|
AspirinSJL
deleted the
revert-14412-revert-14376-revert-14356-revert-13671-grpclb_reresolution
branch
February 20, 2018 03:19
Merged
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reverts #14412
The new commit fixes several issues with the old test:
UpdatesWithClientLoadReportingTest
forReresolveDeadBalancer
.BalanceLoad()
invocations, each in a separate thread but sharing the same balancer instance. During shutdown, we callNotifyDoneWithServerlists()
to notify the condition change toservserverlist_ready_
to unblockBalanceLoad()
. But resettingservserverlist_ready_
in BalanceLoad() after the notification will cause the secondBalanceLoad()
thread waiting forever. So the resetting is removed. I have runGRPC_POLL_STRATEGY=epoll1 tools/run_tests/run_tests.py -l c++ -r 'grpclb_end2end_test' -n inf -c opt -a 8
overnight (>239k runs). No new flake is introduced.===============================================
Debugging this timeout issue is like demystifying how condition variable works. Below is the reasoning about the fix, for future reference.
Basically, we should remove the resetting anyway, because it makes more sense to not reset. Once we are done with the serverlist, we are done with the serverlist.
If we remove the resetting, the timeout issue can also be fixed.
And the reason why using
notify_all()
but keeping the resetting can't fix the timeout flake is that the threads waiting for the notification still need to wake up one by one, since the first one resets the condition before releasing the mutex, the second one will wake up with the condition being false.Also, if we keep the resetting, but increase the number of
NotifyDoneWithServerlists()
which (sets condition + notifies), the flakiness will decrease but still exist. That's because of the idempotency ofNotifyDoneWithServerlists()
. No matter how manyNotifyDoneWithServerlists()
we add, it's possible that the first thread wakes up after all of them. And unfortunately, the first thread resets the condition before releasing the mutex. Even though the second thread wakes up multiple times, the condition is always false when it wakes up.