New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG after #12145] Add "Randomized SVD" solver option to KernelPCA for faster partial decompositions, like in PCA #12069
Conversation
It seems that the tests fail on some python versions because of What is the reference way to perform matrix multiplication in sklearn ? I will try 'dot' |
Ok now everything seems fine. Waiting for review (feel free to react in #12068 also) |
Concerning the additional equivalence test (or modifying an existing test to explicitly add the option): agreed. I am still not able to run sklearn tests locally because of compilation issues on mingw32/windows, but I'll try. |
I just pushed 'blindly' (no local run!) what I think is right to update in the tests, at least to share the intent. I also added docstring to all tests so that we can document what was the intent of each - I guess that's easier to maintain on the long run. |
I managed to make the modified tests run locally without recompiling everything, and fixed them. I let you check - it seems sufficient now. |
As far as I see you did not add a test that checks that the results of the randomized solver are comparable to the one obtained with the "dense" or "arpack" solvers. Is that right? I am not sure how to best test this but I think we need a basic sanity equivalence test here.
Also could you please do a quick benchmark on a dataset of your choice to highlight the computational impact of the new solver? No need to include the benchmark code in the PR itself, just report it with the results as a comment to this PR discussion.
Also have you had a look at the plots that use KernelPCA in the example to ensure that you get qualitatively the same results? Please include copy of the resulting images in the PR discussion as well.
sklearn/decomposition/kernel_pca.py
Outdated
For eigen_solver == 'arpack', refer to `scipy.sparse.linalg.svds`. | ||
For eigen_solver == 'randomized', see: | ||
`Finding structure with randomness: Stochastic algorithms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this use of backticks is appropriate in Restructured Text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I am a newbie in rst / sphinx - I use markdows+mkdocs in my own open source projects. Here I simply copied the style of what I saw in PCA. Should I change it ? If so we should change PCA doc too.
Thanks @ogrisel for the detailed review. Concerning the tests, I have added the method in all pre-existing tests. As you can see one of them compares the kPCA+linear kernel results with PCA. So if randomized_svd is well implemented in PCA, this test ensures that it is in kPCA, too. But I agree that this was not very satisfying, so I created a new test (not pushed yet). In this test I compare both the result and execution times of the 'dense', 'arpack' and 'randomized' methods on a randomly generated dataset, so as to assert both time performance gain and accuracy. With 5000 random samples and n_component = 50, it will hopefully show the performance gain (currently I have the following running times: dense: 13s, arpack: 19s, randomized: 1s). I will try to come up with some plots to see the exec times evolution according to the number of components. While doing my experiments I discovered that I need to add safety protection against negative and imaginary eigenvalues for this method. Typically if I find an imaginary eigenvalue I will raise and error and if I find a negative eigenvalue I will just display a warning. Which error type and warning mechanism should I use so as to match the scikit-learn best practices ? |
From reading the rest of the source code I see that I should probably use
|
(EDITED: these old benchmarks are not the best ones to see the perf gain. See this post instead. Here are the results of my benchmark (x axis is in log scale). Note: since I use random data, the kernel conditioning is quite bad (very small eigenvalues after a few dozen components), but that should not change the execution times comparison . |
…ized are equivalent to full but go faster (for arpack only in some cases). This test can be transformed into a benchmark to reproduce the plots in PR scikit-learn#12069, by simply switching the benchmark_mode flag to True.
Note: I put the benchmark code used to produce the above plots, into the test code. A flag allows developpers to switch to benchmark mode. Maybe it will be better / more practical to entirely separate the benchmark from the test, and for example to put it in the 'examples' codes ? |
In the meantime I did the experiment that you suggested @ogrisel : here are the projection results and execution times of various approximations (each column corresponds to a different choice of for the legacy version: and the new one (where randomized is used in 'auto' mode when n_components < 0.8 * n_samples): Note the execution times in parenthesis. I have a corresponding sample file ready to push in the "examples" folder, if you think that this is relevant. I have also created a sample file for the previous figures (the "true" benchmark). |
This pull request introduces 2 alerts when merging 8d462e1 into 86931ad - view on LGTM.com new alerts:
Comment posted by LGTM.com |
This pull request introduces 2 alerts when merging 6026252 into 4207225 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
This pull request introduces 2 alerts when merging f366511 into 88b49e5 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
@smarie , thanks for the equivalence test, can you also address this comment?
Also, how would you explain that the issues in #12143 and #12145 popped up when you were adding the randomized solver? I mean, why are they popping up now and not before? @jnothman , @ogrisel , I find the equivalence test fairly convincing. What do you think of the execution time plot? Is the small time improvement worth it? I just want to make sure this would be a welcome addition before reviewing #12145 and this one in more depth. I've already reviewed #12143 and I think it could/should be merged regardless of this one. |
Thanks for the review @NicolasHug.
Just for information, are we supposed to add ourselves in the authors names list at the top of the file when we propose such PRs ? |
We have no strict rules about authors lists in files. If you feel you made
a substantial contribution to the file and want to be credited and
ultimately contacted for your work, you can add your name.
|
I imagined another experiment to show the benefits of this PR: I generate an increasingly big number of samples (from 100 to 10000), and perform a KernelPCA decomposition to always get the first 100 components from it. In practice 100 components are enough to get a fairly good approximation of the underlying distribution for many real-world datasets - even if you have more data to provide you should not need many more components to describe them. The execution times speak for themselves: EDIT: another run with arpack: |
Thanks for the update. Dumb question: those plots are using your current PR's code, not matlab, right? BTW you can include benchmark code (if you deem necessary) with a github gist or by using a |
Yes they are from the PR code of course :)
|
This pull request introduces 2 alerts when merging 92eeed8 into d4c9e84 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
Sure: we use it for anomaly detection, on sensor data (timeseries). See H. Hoffmann, “Kernel pca for novelty detection,” Pattern Recognition, vol. 40, no. 3, pp. 863–874, 2007. There are only a few bits missing in the scikit learn implementation for this to run out of the box, in particular the notion of computing "distance to normality" is not present in the current We could definitely think of a pull request about adding these features to kPCA in sklearn, so that it can be added to https://scikit-learn.org/stable/modules/outlier_detection.html#overview-of-outlier-detection-methods In my mind the current PR was more a "first test" that could lead to other contributions (such as M5P PR that is pending, or anomaly detection PR described above). However current PR was initiated in Sept 2018 and, well, it took a little bit of time to make it to scikit learn. Of course this included a huge ramp-up in quality/maturity for me |
Thanks for the feedback, this is very interesting. Maybe @albertcthomas would be interested in the outlier/novelty detection aspects of the potential follow-ups PR to improve |
Another idea for a follow-up PR: add n_oversamples_effective = max(n_oversamples, int(oversample_ratio * n_components))
|
For the second review, maybe @TomDLT would like to give this PR a look for instance? |
+1, this is exactly what I had in mind at the conclusion of the bench :) ! I was thinking of a default value of 1 but you're right it is better to take some margin here. |
Very nice work !
I did not go through the full discussion, but the diff looks great and well tested.
Running the benchmark myself shows some good performance improvement. Good job!
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
…nto randomized_svd_kPCA � Conflicts: � doc/whats_new/v1.0.rst � sklearn/decomposition/tests/test_kernel_pca.py
…t-learn into randomized_svd_kPCA � Conflicts: � doc/whats_new/v1.0.rst
All ok for me, one comment left open (args ordering in init) |
Tests are failing because of the merge with You can also ignore my comment on args ordering, since positional arguments will stop being allowed in version 1.0. |
Thanks @TomDLT ! |
Merged! Thank you very much @smarie! |
From an old comment: |
Hi @lobpcg sorry for seeing this just now! No, this hasn't been implemented in the merge. However following the same modifications as in this PR should provide an easy guideline. |
Fixes #12068
The text was updated successfully, but these errors were encountered: