Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNT initialize weights when using ARPACK solver with a utility #18302

Merged
merged 67 commits into from Nov 27, 2020

Conversation

@gauravkdesai
Copy link
Contributor

@gauravkdesai gauravkdesai commented Aug 30, 2020

Reference Issues/PRs

Fixes #5545
fixes #11524

What does this implement/fix? Explain your changes.

This PR creates a function sklearn.utils.init_arpack_v0(size, random_state) which goal is to be used each time eigs, eigsh or svds from scipy.sparse.linalg is called. It initializes the v0 parameter correctly with value sampled from the uniform distribution in [-1, 1] (like in ARPACK) to avoid convergence issues with another initialization. The v0 parameter is mandatory as it is the only way to render linalg functions behaviour deterministic.

Any other comments?

@gauravkdesai gauravkdesai changed the title [WIP]Creation of a v0 ARPACK initialisation function [MRG]Creation of a v0 ARPACK initialisation function Aug 31, 2020
@gauravkdesai
Copy link
Contributor Author

@gauravkdesai gauravkdesai commented Aug 31, 2020

@jeremiedbb who can review this PR?

@cmarmo
Copy link
Member

@cmarmo cmarmo commented Aug 31, 2020

@gauravkdesai thanks for your pull request. I can't see why codecov is failing, I think this is unrelated to your PR.
I see that you (or was that in some previous commit?) fix some flake8 issues in the 0.21 what's new. Do you mind reverting those changes? They are not linked to the arpack function. Thanks!

@gauravkdesai
Copy link
Contributor Author

@gauravkdesai gauravkdesai commented Aug 31, 2020

@cmarmo those are just spaces removed at end of line by my post commit hook. If I change the PR setting to "ignore whitespace changes" then I don't see any text change in 0.21 what's new.
I had to touch "0.21 what's new" to move text for this PR from 0.21 to 0.24.

Let me know if you still want me to take care of those trailing white spaces diff.

@cmarmo
Copy link
Member

@cmarmo cmarmo commented Sep 1, 2020

@gauravkdesai , modifying files unrelated to the PR is sometimes confusing for the review and for future git blame results. But this is just a suggestion, I'm not the one deciding for the merge... ;)
FYI, revert the differences with upstream is quite straightforward, if your master is synchronized with upstream/master:

$ git checkout master doc/whats_new/v0.21.rst
@gauravkdesai
Copy link
Contributor Author

@gauravkdesai gauravkdesai commented Sep 1, 2020

@gauravkdesai , modifying files unrelated to the PR is sometimes confusing for the review and for future git blame results. But this is just a suggestion, I'm not the one deciding for the merge... ;)
FYI, revert the differences with upstream is quite straightforward, if your master is synchronized with upstream/master:

$ git checkout master doc/whats_new/v0.21.rst

Thanks @cmarmo, I didn't know it was that simple, I have now restored the 0.21 what's new file.

@cmarmo
Copy link
Member

@cmarmo cmarmo commented Sep 1, 2020

Thanks @gauravkdesai. In the meanwhile, conflicts appeared...

@gauravkdesai
Copy link
Contributor Author

@gauravkdesai gauravkdesai commented Nov 25, 2020

@cmarmo, @thomasjpfan any idea how to get this thru the seconds review and merged?

@cmarmo
Copy link
Member

@cmarmo cmarmo commented Nov 25, 2020

@gauravkdesai version 0.24 should be prepared and people are quite busy with some hard PR planned for 0.24. You already showed a lot of patience (thanks!), so I'm milestoning this PR for 0.24, knowing that it is not a blocker though.

@cmarmo cmarmo added this to the 0.24 milestone Nov 25, 2020
@gauravkdesai
Copy link
Contributor Author

@gauravkdesai gauravkdesai commented Nov 25, 2020

@cmarmo thanks 👍

@glemaitre glemaitre self-requested a review Nov 26, 2020
glemaitre added 2 commits Nov 26, 2020
Copy link
Contributor

@glemaitre glemaitre left a comment

LGTM. I pushed a couple of small fixes. Waiting for the CIs to be green. (pending depending on the resolution of the next comment)

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Nov 26, 2020

Arff, I have a second thought on the PR itself. Indeed, we do the two following change:

  • change a rng.uniform call by a _init_arpack_v0
  • _init_arpack_v0 is checking the random state which is called in fit (or it should be if not)

So the new changes make the code:

  • slower due to several repeated checks and nested function calls (this is probably marginal);
  • not explicit because rng.uniform allows us to know which distribution and limits are used to generate the weights without looking at the documentation while _init_arpack_v0 is not explicit.

So IMO, I would make sure that we call check_random_state whenever we should and initialize the weights without an additional function.

Because I disagree with the original posting, I will just delay the merging to have some second thoughts on the matter. The PR does not need to be closed since all the changes that I mentioned can take place here.

@gauravkdesai I am sorry to put some barriers at the last minute since you did the perfect job that was requested initially.

@glemaitre glemaitre self-requested a review Nov 26, 2020
@ogrisel
Copy link
Member

@ogrisel ogrisel commented Nov 27, 2020

check_random_state should not introduce any significant overhead when the passed random_state argument is already an np.random.RandomState instance. check_random_state does not mutate the passed np.random.RandomState instance either, so there is no new problem introduced in that PR in that respect (I think).

However, I am not sure about the change in PLSSVD. Please let's wait for my review before considering merging this PR.

Copy link
Member

@ogrisel ogrisel left a comment

If we want to keep this PR for 0.24 let's focus it on the original issue and extract the change on PLSSVD to another PR that would need significantly more work as explained below:

U, s, Vt = svd(C, full_matrices=False)
else:
v0 = _init_arpack_v0(n_min_dim, self.random_state)
U, s, Vt = svds(C, k=n_components, v0=v0)

This comment has been minimized.

@ogrisel

ogrisel Nov 27, 2020
Member

I do not understand this change. In scikit-learn master PLSSVD is only using the LAPACK svd solver (from scipy.linalg.svd) if I understand correctly.

This change is making it use ARPACK when n_components is small enough. While I am not opposed to this change in general, I think this is a new feature / optim that would deserve its own pull request along with tests. It's not just the neutral internal refactoring / RNG fix that was the initial goal of this PR. I think this PR should be restricted to estimators that already use the ARPACK solver in master.

If we want to pursue this change (in a new PR) we might was well consider adding the option to use sklearn's own randomized_svd solver which has the potential to be even faster if if possibly not has fast.

We should also expose the choice of the SVD solver in the constructor parameter of PLSSVD as we do for other estimators such as PCA and TruncatedSVD to make it possible for our users to finely control the speed / numerical or statistical accuracy trade-off.

To preserve usability for users who don't care about those details, we can implement some "auto" mode by default as we already do for PCA and TruncatedSVD. We shall also run a small benchmark study to make sure that the chosen "auto" strategy brings the expected performance improvements while keeping results statistically accurate enough.

doc/whats_new/v0.24.rst Outdated Show resolved Hide resolved
doc/whats_new/v0.24.rst Outdated Show resolved Hide resolved
@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Nov 27, 2020

@gauravkdesai I will make the changes asked by @ogrisel we can quickly merge your PR for branching 0.24

@gauravkdesai
Copy link
Contributor Author

@gauravkdesai gauravkdesai commented Nov 27, 2020

@gauravkdesai I will make the changes asked by @ogrisel we can quickly merge your PR for branching 0.24

@glemaitre Super, thanks mate.

@glemaitre glemaitre changed the title [MRG]Creation of a v0 ARPACK initialisation function MNT initialize weights when using ARPACK solver with a utility Nov 27, 2020
Copy link
Member

@ogrisel ogrisel left a comment

LGTM, thanks all!

@ogrisel ogrisel merged commit da562b4 into scikit-learn:master Nov 27, 2020
26 checks passed
26 checks passed
triage
Details
Check build trigger
Details
Build wheel for cp${{ matrix.python }}-${{ matrix.platform_id }}
Details
Source distribution
Details
Upload to Anaconda
Details
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc artifact Link to 0/doc/_changed.html
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 100.00% of diff hit (target 98.18%)
Details
codecov/project 98.18% (-0.01%) compared to 697ceee
Details
scikit-learn.scikit-learn Build #20201127.15 succeeded
Details
scikit-learn.scikit-learn (Linting) Linting succeeded
Details
scikit-learn.scikit-learn (Linux py36_conda_openblas) Linux py36_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py36_ubuntu_atlas) Linux py36_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py36_ubuntu_atlas_32bit) Linux32 py36_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Linux_Runs pylatest_conda_mkl) Linux_Runs pylatest_conda_mkl succeeded
Details
scikit-learn.scikit-learn (Windows py36_pip_openblas_32bit) Windows py36_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_forge_mkl) macOS pylatest_conda_forge_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl_no_openmp) macOS pylatest_conda_mkl_no_openmp succeeded
Details
@ogrisel
Copy link
Member

@ogrisel ogrisel commented Nov 27, 2020

I opened #18931 to not forget about the suggested change of solver for PLSSVD.

@gauravkdesai
Copy link
Contributor Author

@gauravkdesai gauravkdesai commented Nov 27, 2020

Wow, super cool. this was my first PR in open source, hopefully many more will follow.
Thanks @glemaitre, @ogrisel, @cmarmo and @thomasjpfan for all the help and guidance. Learnt a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

6 participants