Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add lapack driver argument to randomized svd function #20617

Merged
merged 16 commits into from May 17, 2022

Conversation

skailasa
Copy link
Contributor

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR adds the parameter to specify the Lapack driver for the internal SVD taken over the matrix B, which is the projection of the matrix being compressed in a low-dimensional subspace (see Halko et al).

This is useful in numerical applications where the defualt 'gesdd' does not offer enough precision.

Any other comments?

This function should also support double precision. However, this should be the subject of a further PR.

@skailasa skailasa changed the title Add lapack driver ENH: Add lapack driver argument to randomized svd function Jul 27, 2021
Copy link
Contributor

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an entry to the change log at doc/whats_new/v*.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

You need to add an additional test to check that we can indeed pass the new parameter as expected.

sklearn/utils/extmath.py Outdated Show resolved Hide resolved
sklearn/utils/extmath.py Outdated Show resolved Hide resolved
sklearn/utils/extmath.py Outdated Show resolved Hide resolved
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Could you please add a small unit test for this option?

sklearn/utils/extmath.py Outdated Show resolved Hide resolved
sklearn/utils/extmath.py Outdated Show resolved Hide resolved
@skailasa
Copy link
Contributor Author

skailasa commented Jul 28, 2021

I added the test however left the quotation marks as single rather than double around variable names as suggested above in order to match the rest of this function/module.

@glemaitre
Copy link
Contributor

quotation marks as single rather than double around variable names

I would expect black to fail then since the standard is double-quoted.
FYI: https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=30899&view=logs&jobId=32e2e1bb-a28f-5b18-6cfc-3f01273f5609&j=32e2e1bb-a28f-5b18-6cfc-3f01273f5609&t=fc67071d-c3d4-58b8-d38e-cafc0d3c731a

You just probably want to apply black on the two files and that's it.

doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved
sklearn/utils/extmath.py Outdated Show resolved Hide resolved
@glemaitre
Copy link
Contributor

@skailasa would you be able to complete your PR?

@skailasa
Copy link
Contributor Author

skailasa commented Sep 7, 2021

@skailasa would you be able to complete your PR?

Hi, yes sorry about dropping the ball on this. Been held up elsewhere/been on my holidays. WIll complete this today!

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

sklearn/utils/tests/test_extmath.py Outdated Show resolved Hide resolved
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed nice enhancement.

Whether to use the more efficient divide-and-conquer approach
(`"gesdd"`) or more general rectangular approach (`"gesvd"`) to compute
the SVD of the matrix B, which is the projection of M into a low
dimensional subspace, as described by Halko et. al.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as described by Halko et. al.

Can you please add a reference to the mention article?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a reference to the function's docstring, as it's mentioned multiple times during the parameter documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I just checked and a reference is already in the function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. In that case, can you adapt the doc-string to add a Sphinx cross-reference similarly to this?

References
----------
.. [1] Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel,
and Jörg Sander. "OPTICS: ordering points to identify the clustering
structure." ACM SIGMOD Record 28, no. 2 (1999): 49-60.
.. [2] Schubert, Erich, Michael Gertz.
"Improving the Cluster Structure Extracted from OPTICS Plots." Proc. of
the Conference "Lernen, Wissen, Daten, Analysen" (LWDA) (2018): 318-329.

def test_randomized_svd_lapack_driver():
# Check that different SVD drivers provide consistent results

# Matrix being compressed
n = 123
m = 456
k = 10
rng = np.random.RandomState(0)
X = rng.rand(n, m)

# Number of components
k = 10
Copy link
Member

@jjerphan jjerphan Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about parametrizing this test on a few combinations of n, m, k and on the random seed?

For instance:

Suggested change
def test_randomized_svd_lapack_driver():
# Check that different SVD drivers provide consistent results
# Matrix being compressed
n = 123
m = 456
k = 10
rng = np.random.RandomState(0)
X = rng.rand(n, m)
# Number of components
k = 10
@pytest.mark.parametrize("n", [50, 100, 300])
@pytest.mark.parametrize("m", [50, 100, 300])
@pytest.mark.parametrize("k", [10, 20, 50])
@pytest.mark.parametrize("seed", range(5))
def test_randomized_svd_lapack_driver(n, m, k, seed):
# Check that different SVD drivers provide consistent results
# Matrix being compressed
rng = np.random.RandomState(seed)
X = rng.rand(n, m)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@jjerphan jjerphan Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adapted my suggestion which was shadowing the parametrized value.

(It looks like you have not committed all the changes or pushed them.)

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, after removing the last shadowing.

X = rng.rand(n, m)

# Number of components
k = 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this originates from the mistake I made when suggesting and this has to be removed.

Suggested change
k = 10

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @skailasa !

dimensional subspace, as described in [1]_.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dimensional subspace, as described in [1]_.
dimensional subspace, as described in [1]_.
.. versionadded:: 1.1

@@ -913,6 +913,10 @@ Changelog
these functions were not documented and part from the public API.
:pr:`20521` by :user:`Olivier Grisel <ogrisel>`.

- |Enhancement| :func:`utils.extmath.randomized_svd` now accepts an argument
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to move into 1.1

@@ -332,6 +333,12 @@ def randomized_svd(
the value of `random_state` explicitly to suppress the deprecation
warning.

lapack_svd_driver : {"gesdd", "gesvd"}, default="gesdd"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Do you think svd_lapack_driver is better here? It keeps the term "lapack_driver" together, which makes it a bit easier to search for when looking for "lapack_driver" in the scipy docs.

@cmarmo
Copy link
Member

cmarmo commented May 10, 2022

@skailasa you already have two approvals. Do you mind fixing conflicts and the last comment? Then this can probably be merged. Thanks!

@glemaitre glemaitre self-assigned this May 17, 2022
Copy link
Contributor

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I just applied the suggestions of @thomasjpfan.
I will be merging when the CI turns green.

@glemaitre glemaitre merged commit 0c618b2 into scikit-learn:main May 17, 2022
28 checks passed
@glemaitre
Copy link
Contributor

Thanks @skailasa

lesteve pushed a commit to lesteve/scikit-learn that referenced this pull request May 19, 2022
…rn#20617)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
mathijs02 pushed a commit to mathijs02/scikit-learn that referenced this pull request Dec 27, 2022
…rn#20617)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants