New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add lapack driver argument to randomized svd function #20617
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an entry to the change log at doc/whats_new/v*.rst
. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
.
You need to add an additional test to check that we can indeed pass the new parameter as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Could you please add a small unit test for this option?
I added the test however left the quotation marks as single rather than double around variable names as suggested above in order to match the rest of this function/module. |
I would expect You just probably want to apply black on the two files and that's it. |
@skailasa would you be able to complete your PR? |
Hi, yes sorry about dropping the ball on this. Been held up elsewhere/been on my holidays. WIll complete this today! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed nice enhancement.
sklearn/utils/extmath.py
Outdated
Whether to use the more efficient divide-and-conquer approach | ||
(`"gesdd"`) or more general rectangular approach (`"gesvd"`) to compute | ||
the SVD of the matrix B, which is the projection of M into a low | ||
dimensional subspace, as described by Halko et. al. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as described by Halko et. al.
Can you please add a reference to the mention article?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a reference to the function's docstring, as it's mentioned multiple times during the parameter documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I just checked and a reference is already in the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. In that case, can you adapt the doc-string to add a Sphinx cross-reference similarly to this?
scikit-learn/sklearn/cluster/_optics.py
Lines 203 to 211 in 16419d1
References | |
---------- | |
.. [1] Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, | |
and Jörg Sander. "OPTICS: ordering points to identify the clustering | |
structure." ACM SIGMOD Record 28, no. 2 (1999): 49-60. | |
.. [2] Schubert, Erich, Michael Gertz. | |
"Improving the Cluster Structure Extracted from OPTICS Plots." Proc. of | |
the Conference "Lernen, Wissen, Daten, Analysen" (LWDA) (2018): 318-329. |
sklearn/utils/tests/test_extmath.py
Outdated
def test_randomized_svd_lapack_driver(): | ||
# Check that different SVD drivers provide consistent results | ||
|
||
# Matrix being compressed | ||
n = 123 | ||
m = 456 | ||
k = 10 | ||
rng = np.random.RandomState(0) | ||
X = rng.rand(n, m) | ||
|
||
# Number of components | ||
k = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about parametrizing this test on a few combinations of n
, m
, k
and on the random seed?
For instance:
def test_randomized_svd_lapack_driver(): | |
# Check that different SVD drivers provide consistent results | |
# Matrix being compressed | |
n = 123 | |
m = 456 | |
k = 10 | |
rng = np.random.RandomState(0) | |
X = rng.rand(n, m) | |
# Number of components | |
k = 10 | |
@pytest.mark.parametrize("n", [50, 100, 300]) | |
@pytest.mark.parametrize("m", [50, 100, 300]) | |
@pytest.mark.parametrize("k", [10, 20, 50]) | |
@pytest.mark.parametrize("seed", range(5)) | |
def test_randomized_svd_lapack_driver(n, m, k, seed): | |
# Check that different SVD drivers provide consistent results | |
# Matrix being compressed | |
rng = np.random.RandomState(seed) | |
X = rng.rand(n, m) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adapted my suggestion which was shadowing the parametrized value.
(It looks like you have not committed all the changes or pushed them.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, after removing the last shadowing.
sklearn/utils/tests/test_extmath.py
Outdated
X = rng.rand(n, m) | ||
|
||
# Number of components | ||
k = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this originates from the mistake I made when suggesting and this has to be removed.
k = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @skailasa !
dimensional subspace, as described in [1]_. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dimensional subspace, as described in [1]_. | |
dimensional subspace, as described in [1]_. | |
.. versionadded:: 1.1 | |
doc/whats_new/v1.0.rst
Outdated
@@ -913,6 +913,10 @@ Changelog | |||
these functions were not documented and part from the public API. | |||
:pr:`20521` by :user:`Olivier Grisel <ogrisel>`. | |||
|
|||
- |Enhancement| :func:`utils.extmath.randomized_svd` now accepts an argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to move into 1.1
sklearn/utils/extmath.py
Outdated
@@ -332,6 +333,12 @@ def randomized_svd( | |||
the value of `random_state` explicitly to suppress the deprecation | |||
warning. | |||
|
|||
lapack_svd_driver : {"gesdd", "gesvd"}, default="gesdd" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Do you think svd_lapack_driver
is better here? It keeps the term "lapack_driver" together, which makes it a bit easier to search for when looking for "lapack_driver" in the scipy docs.
@skailasa you already have two approvals. Do you mind fixing conflicts and the last comment? Then this can probably be merged. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I just applied the suggestions of @thomasjpfan.
I will be merging when the CI turns green.
Thanks @skailasa |
…rn#20617) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
…rn#20617) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Reference Issues/PRs
What does this implement/fix? Explain your changes.
This PR adds the parameter to specify the Lapack driver for the internal SVD taken over the matrix B, which is the projection of the matrix being compressed in a low-dimensional subspace (see Halko et al).
This is useful in numerical applications where the defualt 'gesdd' does not offer enough precision.
Any other comments?
This function should also support double precision. However, this should be the subject of a further PR.