Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Cythonize dump_svmlight_file #23127

Merged
merged 57 commits into from Jun 16, 2022

Conversation

Micky774
Copy link
Contributor

@Micky774 Micky774 commented Apr 13, 2022

Reference Issues/PRs

Fixes #15527

What does this implement/fix? Explain your changes.

Provides a Cython implementation of _dump_svmlight_file. This implementation provides significant (5-8x) speed-up on dense inputs, but no speedup on sparse inputs.

Any other comments?

While working on this, I realized that while we omit zero-values in the label array y, we also (indirectly) require that if y is sparse then len(y.data)==len(X) since we iterate through them both simultaneously w/ the same loop. This seems odd and potentially erroneous, but I'm not experienced w/ the format so I wanted to see if anyone had any thoughts/clarifications regarding this.

To Do

  • Profile original Python implementation
  • Profile current Cython implementation
  • Consider Cythonizing a small subset of Python implementation
  • Try "naive" Cythonization of Python implementation

@Micky774 Micky774 marked this pull request as draft Apr 13, 2022
@Micky774 Micky774 marked this pull request as ready for review Apr 14, 2022
@Micky774 Micky774 changed the title ENH Cythonize dump svmlight for dense matrices ENH Cythonize dump_svmlight_file Apr 14, 2022
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two optimizations here. One specifically for dense dense and another one for the general case. If we want this easier to review, we can split this into two PRs:

  1. General case.
  2. Dense dense case.

In either case, we need benchmarks comparing general case in this PR with the implementation on main.

sklearn/datasets/_svmlight_format_io.py Outdated Show resolved Hide resolved
@Micky774
Copy link
Contributor Author

Micky774 commented May 1, 2022

For reference, here is the code I'm using to benchmark. Here is the raw data in case you'd like to play around with it.

For the dense/dense case, here is a comparison of their average run time (averaged over 7 trials each):
image

Here is the graph of the improvement ratio (time taken on main divided by time taken on this PR branch).
image

Same graph for a sparse X but dense y. No improvement, in fact seems there's a significant slowdown in some cases (when n_features is large):
image

Indeed, as expected there's only significant improvement when X is dense:
image

Otherwise, if X is sparse, it's generally slightly worse than the current implementation.

@Micky774
Copy link
Contributor Author

After incorporating the changes @thomasjpfan suggested, the cythonized version ~6-8x faster in the dense/dense case, and at worst is now slower than the python implementation, even for the sparse/sparse case. I'll be providing benchmark results soon.

@Micky774
Copy link
Contributor Author

@jeremiedbb @ogrisel Wondering if either of you two would be interested in reviewing this :)

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Micky774, nice to meet you. 👋

Thank you for this contribution. Here are a few comments and questions.

sklearn/datasets/_svmlight_format_io.py Outdated Show resolved Hide resolved
sklearn/datasets/_svmlight_format_fast.pyx Outdated Show resolved Hide resolved
sklearn/datasets/_svmlight_format_fast.pyx Outdated Show resolved Hide resolved
sklearn/datasets/_svmlight_format_fast.pyx Outdated Show resolved Hide resolved
sklearn/datasets/_svmlight_format_fast.pyx Outdated Show resolved Hide resolved
sklearn/datasets/_svmlight_format_fast.pyx Outdated Show resolved Hide resolved
sklearn/datasets/_svmlight_format_fast.pyx Outdated Show resolved Hide resolved
sklearn/datasets/_svmlight_format_io.py Outdated Show resolved Hide resolved
sklearn/datasets/tests/test_svmlight_format.py Outdated Show resolved Hide resolved
sklearn/datasets/_svmlight_format_fast.pyx Outdated Show resolved Hide resolved
sklearn/datasets/_svmlight_format_fast.pyx Outdated Show resolved Hide resolved
sklearn/datasets/_svmlight_format_fast.pyx Outdated Show resolved Hide resolved
sklearn/datasets/_svmlight_format_fast.pyx Outdated Show resolved Hide resolved
sklearn/datasets/_svmlight_format_fast.pyx Outdated Show resolved Hide resolved
sklearn/datasets/_svmlight_format_fast.pyx Outdated Show resolved Hide resolved
sklearn/datasets/_svmlight_format_fast.pyx Outdated Show resolved Hide resolved
sklearn/datasets/tests/test_svmlight_format.py Outdated Show resolved Hide resolved
sklearn/datasets/tests/test_svmlight_format.py Outdated Show resolved Hide resolved
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits otherwise LGTM

sklearn/datasets/tests/test_svmlight_format.py Outdated Show resolved Hide resolved
sklearn/datasets/tests/test_svmlight_format.py Outdated Show resolved Hide resolved
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you, @Micky774.

sklearn/datasets/tests/test_svmlight_format.py Outdated Show resolved Hide resolved
@jjerphan jjerphan merged commit 2e3a047 into scikit-learn:main Jun 16, 2022
29 checks passed
@Micky774 Micky774 deleted the cython_dump_svmlight branch Jun 16, 2022
@lesteve
Copy link
Member

lesteve commented Jun 17, 2022

Looks like this PR does not compile with Cython alpha release ... more info in #23668

ogrisel pushed a commit to ogrisel/scikit-learn that referenced this pull request Jul 11, 2022
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
mathijs02 pushed a commit to mathijs02/scikit-learn that referenced this pull request Dec 27, 2022
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dump_svmlight_file is slow and not Cython-optimized
4 participants