Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh aspired servables/versions following config update #1518

Open
wants to merge 3 commits into
base: r1.15
from

Conversation

@njhill
Copy link

@njhill njhill commented Dec 18, 2019

Currently when the configured model list is updated via a call to handleReloadConfigRequest, the request thread blocks until any newly added models become available.

Their availability however depends on the filesystem polling thread rescanning the filesystem at some periodic interval, meaning that there's an arbitrary delay before the requested changes actually take effect and the RPC returns.

This problem may not be very noticeable with the default polling interval of 1 second, but seems undesirable for longer intervals and in particular makes API-based dynamic reconfiguration incompatible with the --file_system_poll_wait_seconds=0 setting (in this case all handleReloadConfigRequest calls time-out and do not take effect).

Fixes #1519

Currently when the configured model list is updated via a call to
handleReloadConfigRequest, the request thread blocks until any newly
added models become available.

Their availability however depends on the filesystem polling thread
rescanning the filesystem at some periodic interval, meaning that
there's an arbitrary delay before the requested changes actually take
effect and the RPC returns.

This problem may not be very noticeable with the default polling
interval of 1 second, but seems undesirable for longer intervals and
in particular makes API-based dynamic reconfiguration incompatible with
the --file_system_poll_wait_seconds=0 setting (in this case all
handleReloadConfigRequest calls time-out and do not take effect).
@googlebot googlebot added the cla: yes label Dec 18, 2019
@njhill
Copy link
Author

@njhill njhill commented Dec 18, 2019

I have opened this against 1.15 since that's the version we are using, but can rebase on a different branch if needed.

Also apologies in advance for the code, I am not very familiar with C++.

@njhill njhill requested review from nrobeR and christisg and removed request for nrobeR Jan 15, 2020
Copy link
Member

@christisg christisg left a comment

Can we add unit test coverage?

@njhill
Copy link
Author

@njhill njhill commented Feb 12, 2020

Thanks @christisg, I've pushed a commit to address your logging comment. I will aim to add unit test coverage when I get a chance... it will take me a bit longer due to unfamiliarity with C++ and the codebase/test framework.

@astleychen
Copy link

@astleychen astleychen commented Feb 20, 2020

@njhill thanks for reporting this bug. It's painful and took me more than 3 hrs to figure out REAL behavior when setting --file_system_poll_wait_seconds=0 to mitigate GCS bucket class A/B operation request calls in polling. Hope we can see your fixes soon. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.