Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG after #12145] Add "Randomized SVD" solver option to KernelPCA for faster partial decompositions, like in PCA #12069

Merged
merged 317 commits into from Apr 27, 2021

Conversation

@smarie
Copy link
Contributor

@smarie smarie commented Sep 13, 2018

Fixes #12068

@smarie smarie changed the title Randomized svd k pca Add "Randomized SVD" solver option to KernelPCA for faster partial decompositions, like in PCA Sep 13, 2018
@smarie
Copy link
Contributor Author

@smarie smarie commented Sep 14, 2018

It seems that the tests fail on some python versions because of AttributeError: 'module' object has no attribute 'matmul'. This seems to be because the version of numpy used in this environment does not have matmul.

What is the reference way to perform matrix multiplication in sklearn ? I will try 'dot'

@smarie
Copy link
Contributor Author

@smarie smarie commented Sep 14, 2018

Ok now everything seems fine. Waiting for review (feel free to react in #12068 also)

Copy link
Member

@jnothman jnothman left a comment

This seems reasonable, but needs a test to show near-equivalence.

sklearn/decomposition/kernel_pca.py Outdated Show resolved Hide resolved
sklearn/decomposition/kernel_pca.py Outdated Show resolved Hide resolved
@smarie
Copy link
Contributor Author

@smarie smarie commented Sep 17, 2018

Concerning the additional equivalence test (or modifying an existing test to explicitly add the option): agreed. I am still not able to run sklearn tests locally because of compilation issues on mingw32/windows, but I'll try.

@smarie
Copy link
Contributor Author

@smarie smarie commented Sep 17, 2018

I just pushed 'blindly' (no local run!) what I think is right to update in the tests, at least to share the intent. I also added docstring to all tests so that we can document what was the intent of each - I guess that's easier to maintain on the long run.

@smarie
Copy link
Contributor Author

@smarie smarie commented Sep 17, 2018

I managed to make the modified tests run locally without recompiling everything, and fixed them. I let you check - it seems sufficient now.

Copy link
Member

@ogrisel ogrisel left a comment

As far as I see you did not add a test that checks that the results of the randomized solver are comparable to the one obtained with the "dense" or "arpack" solvers. Is that right? I am not sure how to best test this but I think we need a basic sanity equivalence test here.

Also could you please do a quick benchmark on a dataset of your choice to highlight the computational impact of the new solver? No need to include the benchmark code in the PR itself, just report it with the results as a comment to this PR discussion.

Also have you had a look at the plots that use KernelPCA in the example to ensure that you get qualitatively the same results? Please include copy of the resulting images in the PR discussion as well.

sklearn/decomposition/kernel_pca.py Outdated Show resolved Hide resolved
sklearn/decomposition/kernel_pca.py Outdated Show resolved Hide resolved
sklearn/decomposition/tests/test_kernel_pca.py Outdated Show resolved Hide resolved
Copy link
Member

@jnothman jnothman left a comment

Some small documentation comments to add to @ogrisel's.

sklearn/decomposition/kernel_pca.py Outdated Show resolved Hide resolved
For eigen_solver == 'arpack', refer to `scipy.sparse.linalg.svds`.
For eigen_solver == 'randomized', see:
`Finding structure with randomness: Stochastic algorithms
Copy link
Member

@jnothman jnothman Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this use of backticks is appropriate in Restructured Text

Copy link
Contributor Author

@smarie smarie Sep 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am a newbie in rst / sphinx - I use markdows+mkdocs in my own open source projects. Here I simply copied the style of what I saw in PCA. Should I change it ? If so we should change PCA doc too.

@smarie
Copy link
Contributor Author

@smarie smarie commented Sep 21, 2018

Thanks @ogrisel for the detailed review. Concerning the tests, I have added the method in all pre-existing tests. As you can see one of them compares the kPCA+linear kernel results with PCA. So if randomized_svd is well implemented in PCA, this test ensures that it is in kPCA, too.

But I agree that this was not very satisfying, so I created a new test (not pushed yet). In this test I compare both the result and execution times of the 'dense', 'arpack' and 'randomized' methods on a randomly generated dataset, so as to assert both time performance gain and accuracy.

With 5000 random samples and n_component = 50, it will hopefully show the performance gain (currently I have the following running times: dense: 13s, arpack: 19s, randomized: 1s). I will try to come up with some plots to see the exec times evolution according to the number of components.

While doing my experiments I discovered that I need to add safety protection against negative and imaginary eigenvalues for this method. Typically if I find an imaginary eigenvalue I will raise and error and if I find a negative eigenvalue I will just display a warning.

Which error type and warning mechanism should I use so as to match the scikit-learn best practices ?

@smarie
Copy link
Contributor Author

@smarie smarie commented Sep 21, 2018

From reading the rest of the source code I see that I should probably use ValueError for the exception, and a custom subclass of UserWarning for the warning. For example

class KernelWarning(UserWarning):
    """Custom warning to capture kernel problems

    .. versionadded:: 0.21
    """

@smarie
Copy link
Contributor Author

@smarie smarie commented Sep 21, 2018

(EDITED: these old benchmarks are not the best ones to see the perf gain. See this post instead.

Here are the results of my benchmark (x axis is in log scale). Note: since I use random data, the kernel conditioning is quite bad (very small eigenvalues after a few dozen components), but that should not change the execution times comparison
kpca_methods_bench
kpca_methods_bench_zoom

.

smarie added a commit to smarie/scikit-learn that referenced this issue Sep 21, 2018
…ized are equivalent to full but go faster (for arpack only in some cases). This test can be transformed into a benchmark to reproduce the plots in PR scikit-learn#12069, by simply switching the benchmark_mode flag to True.
@smarie
Copy link
Contributor Author

@smarie smarie commented Sep 21, 2018

Note: I put the benchmark code used to produce the above plots, into the test code. A flag allows developpers to switch to benchmark mode. Maybe it will be better / more practical to entirely separate the benchmark from the test, and for example to put it in the 'examples' codes ?

@smarie
Copy link
Contributor Author

@smarie smarie commented Sep 24, 2018

I created two issues that this PR also solves, so that everything is tracked precisely:

  • #12140 (numerical and conditioning)
  • #12141 (null-space inconsistency)

I will push the corresponding code today

EDIT: I will fix both in separate PRs, that will be even cleaner. Let's wait for those two (#12143, #12145) to be resolved to come back to this one.

@smarie smarie changed the title Add "Randomized SVD" solver option to KernelPCA for faster partial decompositions, like in PCA [On Hold waiting for #12143 and #12145] Add "Randomized SVD" solver option to KernelPCA for faster partial decompositions, like in PCA Sep 24, 2018
@smarie
Copy link
Contributor Author

@smarie smarie commented Sep 25, 2018

In the meantime I did the experiment that you suggested @ogrisel : here are the projection results and execution times of various approximations (each column corresponds to a different choice of n_components), using solver='auto', for 2000 samples,

for the legacy version:

kpca_example_approx_legacy

and the new one (where randomized is used in 'auto' mode when n_components < 0.8 * n_samples):

kpca_example_approx_with_randomized_in_auto

Note the execution times in parenthesis.

I have a corresponding sample file ready to push in the "examples" folder, if you think that this is relevant. I have also created a sample file for the previous figures (the "true" benchmark).

@sklearn-lgtm
Copy link

@sklearn-lgtm sklearn-lgtm commented Sep 26, 2018

This pull request introduces 2 alerts when merging 8d462e1 into 86931ad - view on LGTM.com

new alerts:

  • 2 for Encoding error

Comment posted by LGTM.com

@sklearn-lgtm
Copy link

@sklearn-lgtm sklearn-lgtm commented Sep 26, 2018

This pull request introduces 2 alerts when merging 6026252 into 4207225 - view on LGTM.com

new alerts:

  • 2 for Encoding error

Comment posted by LGTM.com

@sklearn-lgtm
Copy link

@sklearn-lgtm sklearn-lgtm commented Sep 27, 2018

This pull request introduces 2 alerts when merging f366511 into 88b49e5 - view on LGTM.com

new alerts:

  • 2 for Encoding error

Comment posted by LGTM.com

@smarie
Copy link
Contributor Author

@smarie smarie commented Sep 27, 2018

All is ok now + the LGTM alerts have been fixed. @ogrisel, @jnothman : it is now ready for a second round of review.

@NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Oct 12, 2018

@smarie , thanks for the equivalence test, can you also address this comment?

Also have you had a look at the plots that use KernelPCA in the example to ensure that you get qualitatively the same results? Please include copy of the resulting images in the PR discussion as well.

Also, how would you explain that the issues in #12143 and #12145 popped up when you were adding the randomized solver? I mean, why are they popping up now and not before?

@jnothman , @ogrisel , I find the equivalence test fairly convincing. What do you think of the execution time plot? Is the small time improvement worth it?

I just want to make sure this would be a welcome addition before reviewing #12145 and this one in more depth. I've already reviewed #12143 and I think it could/should be merged regardless of this one.

@smarie
Copy link
Contributor Author

@smarie smarie commented Oct 16, 2018

Thanks for the review @NicolasHug.

  • concerning the plots in the example, I did include them in a message above (note: you have to ask github to load all hidden items in the middle of this conversation), and commited a dedicated example to reproduce the plots. One can visually detect no difference.

  • Very good catch concerning #12143 and #12145: indeed they popped up when I added the randomized method. The main reason seems that the randomized method seems to naturally find the "perfect zero eigenvalues" while arpack and dense methods most of the time find a tiny non-zero number instead (e.g. of the order of magnitude of 1e-16 if my memory is correct). So when I tried the new method on random data I quickly found the first zero eigenvalue, which made the division by zero bug appear. For the record this is why I actually struggled quite a lot to find a small example reproducing the zero division bug with dense method :)

  • Concerning the improvement, the benchmark plot already shows 4x faster running times for small amount of dimensions (a gain which is much smaller on the 2D example I admit), but on real-world timeseries data, on matlab environment, we found out that the randomized method could speed up running times more than 10 times and sometimes (very large datasets), it would make a decomposition computable when it would not be computable at all with the full decomposition method because of memory issues.

  • Finally this PR is an open road to integrate the lobpcg solver too by adding these 6 lines in _fit_transform. I did not do it yet because I wanted to be sure that it would be ok to add it in the same PR, and I am still not sure if it brings better results (did not make the test yet) ; but it makes a lot of sense because that would make KernelPCA completely consistent in terms of solvers, with the solvers in PCA and TruncatedSVD. At least that's what the author of lobpcg suggests.

Just for information, are we supposed to add ourselves in the authors names list at the top of the file when we propose such PRs ?
Thanks again for the careful reviews.

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 16, 2018

@smarie
Copy link
Contributor Author

@smarie smarie commented Oct 16, 2018

I imagined another experiment to show the benefits of this PR: I generate an increasingly big number of samples (from 100 to 10000), and perform a KernelPCA decomposition to always get the first 100 components from it. In practice 100 components are enough to get a fairly good approximation of the underlying distribution for many real-world datasets - even if you have more data to provide you should not need many more components to describe them. The execution times speak for themselves:

image

EDIT: another run with arpack:

image

@NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Oct 17, 2018

Thanks for the update. Dumb question: those plots are using your current PR's code, not matlab, right?

BTW you can include benchmark code (if you deem necessary) with a github gist or by using a <details> ... </details> tag.

@smarie
Copy link
Contributor Author

@smarie smarie commented Oct 17, 2018

Yes they are from the PR code of course :)
Actually since they are quite useful to understand the eigen_solver parameter, I suggested to integrate the code generating these plots as new examples:

  • this example generates the approximated circles

  • this one generates the first benchmark (n_samples fixed and n_components increasing)

  • this one generates the second benchmark (n_samples increasing, and n_components fixed)

@sklearn-lgtm
Copy link

@sklearn-lgtm sklearn-lgtm commented Oct 17, 2018

This pull request introduces 2 alerts when merging 92eeed8 into d4c9e84 - view on LGTM.com

new alerts:

  • 2 for Unreachable code

Comment posted by LGTM.com

@smarie
Copy link
Contributor Author

@smarie smarie commented Apr 8, 2021

Sure: we use it for anomaly detection, on sensor data (timeseries).

See H. Hoffmann, “Kernel pca for novelty detection,” Pattern Recognition, vol. 40, no. 3, pp. 863–874, 2007.

There are only a few bits missing in the scikit learn implementation for this to run out of the box, in particular the notion of computing "distance to normality" is not present in the current KernelPCA, nor is the notion of threshold.

We could definitely think of a pull request about adding these features to kPCA in sklearn, so that it can be added to https://scikit-learn.org/stable/modules/outlier_detection.html#overview-of-outlier-detection-methods

In my mind the current PR was more a "first test" that could lead to other contributions (such as M5P PR that is pending, or anomaly detection PR described above). However current PR was initiated in Sept 2018 and, well, it took a little bit of time to make it to scikit learn. Of course this included a huge ramp-up in quality/maturity for me

@ogrisel
Copy link
Member

@ogrisel ogrisel commented Apr 8, 2021

Thanks for the feedback, this is very interesting. Maybe @albertcthomas would be interested in the outlier/novelty detection aspects of the potential follow-ups PR to improve KernelPCA.

@ogrisel
Copy link
Member

@ogrisel ogrisel commented Apr 8, 2021

Another idea for a follow-up PR: add oversample_ratio=1.5 to randomized_svd and _randomized_eigsh in addition to n_oversamples and we could have:

n_oversamples_effective = max(n_oversamples, int(oversample_ratio * n_components))

oversample_ratio=1.5 should be a reasonable default value I think (based on the results of your benchmarks).

@ogrisel
Copy link
Member

@ogrisel ogrisel commented Apr 8, 2021

For the second review, maybe @TomDLT would like to give this PR a look for instance?

@smarie
Copy link
Contributor Author

@smarie smarie commented Apr 8, 2021

max(n_oversamples, int(oversample_ratio * n_components))

+1, this is exactly what I had in mind at the conclusion of the bench :) ! I was thinking of a default value of 1 but you're right it is better to take some margin here.

@smarie
Copy link
Contributor Author

@smarie smarie commented Apr 12, 2021

For the second review, maybe @TomDLT would like to give this PR a look for instance?

@jnothman approved this PR quite recently, not much changes have been done since. Maybe he can simply look at the diff to re-confirm his approval ? Thanks in advance!

TomDLT
TomDLT approved these changes Apr 15, 2021
Copy link
Member

@TomDLT TomDLT left a comment

Very nice work !
I did not go through the full discussion, but the diff looks great and well tested.
Running the benchmark myself shows some good performance improvement. Good job!

benchmarks/bench_kernel_pca_solvers_time_vs_n_samples.py Outdated Show resolved Hide resolved
benchmarks/bench_kernel_pca_solvers_time_vs_n_samples.py Outdated Show resolved Hide resolved
doc/modules/decomposition.rst Show resolved Hide resolved
sklearn/decomposition/_kernel_pca.py Outdated Show resolved Hide resolved
sklearn/decomposition/_kernel_pca.py Outdated Show resolved Hide resolved
sklearn/decomposition/_kernel_pca.py Show resolved Hide resolved
sklearn/decomposition/tests/test_kernel_pca.py Outdated Show resolved Hide resolved
doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved
smarie and others added 7 commits Apr 26, 2021
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
@smarie
Copy link
Contributor Author

@smarie smarie commented Apr 26, 2021

Thanks a lot @TomDLT for taking the time for this detailed review ! With the PR having gone back and forth there were clearly a few inconsistencies left.

@ogrisel there is one comment left open for you, and there is another pending for me for later today (minor doc mod).

Sylvain MARIE added 3 commits Apr 26, 2021
…nto randomized_svd_kPCA

� Conflicts:
�	doc/whats_new/v1.0.rst
�	sklearn/decomposition/tests/test_kernel_pca.py
…t-learn into randomized_svd_kPCA

� Conflicts:
�	doc/whats_new/v1.0.rst
@smarie
Copy link
Contributor Author

@smarie smarie commented Apr 26, 2021

All ok for me, one comment left open (args ordering in init)

@TomDLT
Copy link
Member

@TomDLT TomDLT commented Apr 26, 2021

Tests are failing because of the merge with main (test_kernel_pca_inverse_transform has been removed in #19732 but is still present in this PR).

You can also ignore my comment on args ordering, since positional arguments will stop being allowed in version 1.0.

@smarie
Copy link
Contributor Author

@smarie smarie commented Apr 27, 2021

Thanks @TomDLT !

@smarie
Copy link
Contributor Author

@smarie smarie commented Apr 27, 2021

@TomDLT I implemented the improved test for kernel centering, as you suggested. @ogrisel we should now be ready for merging.

@ogrisel ogrisel merged commit bf0886b into scikit-learn:main Apr 27, 2021
29 checks passed
@ogrisel
Copy link
Member

@ogrisel ogrisel commented Apr 27, 2021

Merged! Thank you very much @smarie!

@lobpcg
Copy link
Contributor

@lobpcg lobpcg commented Apr 27, 2021

From an old comment:
"Finally this PR is an open road to integrate the lobpcg solver too by adding these 6 lines in _fit_transform. I did not do it yet because I wanted to be sure that it would be ok to add it in the same PR, and I am still not sure if it brings better results (did not make the test yet) ; but it makes a lot of sense because that would make KernelPCA completely consistent in terms of solvers, with the solvers in PCA and TruncatedSVD. At least that's what the author of lobpcg suggests."
@smarie - Has this been implemented in the merge?

@smarie
Copy link
Contributor Author

@smarie smarie commented Jun 11, 2021

Hi @lobpcg sorry for seeing this just now! No, this hasn't been implemented in the merge. However following the same modifications as in this PR should provide an easy guideline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment