MNT initialize weights when using ARPACK solver with a utility #18302
Conversation
@jeremiedbb who can review this PR? |
@gauravkdesai thanks for your pull request. I can't see why codecov is failing, I think this is unrelated to your PR. |
@cmarmo those are just spaces removed at end of line by my post commit hook. If I change the PR setting to "ignore whitespace changes" then I don't see any text change in 0.21 what's new. Let me know if you still want me to take care of those trailing white spaces diff. |
@gauravkdesai , modifying files unrelated to the PR is sometimes confusing for the review and for future
|
Thanks @cmarmo, I didn't know it was that simple, I have now restored the 0.21 what's new file. |
Thanks @gauravkdesai. In the meanwhile, conflicts appeared... |
@cmarmo, @thomasjpfan any idea how to get this thru the seconds review and merged? |
@gauravkdesai version 0.24 should be prepared and people are quite busy with some hard PR planned for 0.24. You already showed a lot of patience (thanks!), so I'm milestoning this PR for 0.24, knowing that it is not a blocker though. |
@cmarmo thanks |
LGTM. I pushed a couple of small fixes. Waiting for the CIs to be green. (pending depending on the resolution of the next comment) |
Arff, I have a second thought on the PR itself. Indeed, we do the two following change:
So the new changes make the code:
So IMO, I would make sure that we call Because I disagree with the original posting, I will just delay the merging to have some second thoughts on the matter. The PR does not need to be closed since all the changes that I mentioned can take place here. @gauravkdesai I am sorry to put some barriers at the last minute since you did the perfect job that was requested initially. |
However, I am not sure about the change in PLSSVD. Please let's wait for my review before considering merging this PR. |
If we want to keep this PR for 0.24 let's focus it on the original issue and extract the change on PLSSVD to another PR that would need significantly more work as explained below: |
U, s, Vt = svd(C, full_matrices=False) | ||
else: | ||
v0 = _init_arpack_v0(n_min_dim, self.random_state) | ||
U, s, Vt = svds(C, k=n_components, v0=v0) |
ogrisel
Nov 27, 2020
Member
I do not understand this change. In scikit-learn master PLSSVD
is only using the LAPACK svd solver (from scipy.linalg.svd
) if I understand correctly.
This change is making it use ARPACK when n_components
is small enough. While I am not opposed to this change in general, I think this is a new feature / optim that would deserve its own pull request along with tests. It's not just the neutral internal refactoring / RNG fix that was the initial goal of this PR. I think this PR should be restricted to estimators that already use the ARPACK solver in master.
If we want to pursue this change (in a new PR) we might was well consider adding the option to use sklearn's own randomized_svd
solver which has the potential to be even faster if if possibly not has fast.
We should also expose the choice of the SVD solver in the constructor parameter of PLSSVD
as we do for other estimators such as PCA
and TruncatedSVD
to make it possible for our users to finely control the speed / numerical or statistical accuracy trade-off.
To preserve usability for users who don't care about those details, we can implement some "auto" mode by default as we already do for PCA
and TruncatedSVD
. We shall also run a small benchmark study to make sure that the chosen "auto" strategy brings the expected performance improvements while keeping results statistically accurate enough.
@gauravkdesai I will make the changes asked by @ogrisel we can quickly merge your PR for branching 0.24 |
@glemaitre Super, thanks mate. |
LGTM, thanks all! |
da562b4
into
scikit-learn:master
I opened #18931 to not forget about the suggested change of solver for PLSSVD. |
Wow, super cool. this was my first PR in open source, hopefully many more will follow. |
Reference Issues/PRs
Fixes #5545
fixes #11524
What does this implement/fix? Explain your changes.
This PR creates a function sklearn.utils.init_arpack_v0(size, random_state) which goal is to be used each time eigs, eigsh or svds from scipy.sparse.linalg is called. It initializes the v0 parameter correctly with value sampled from the uniform distribution in [-1, 1] (like in ARPACK) to avoid convergence issues with another initialization. The v0 parameter is mandatory as it is the only way to render linalg functions behaviour deterministic.
Any other comments?