New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Cythonize dump_svmlight_file
#23127
Conversation
dump_svmlight_file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two optimizations here. One specifically for dense dense and another one for the general case. If we want this easier to review, we can split this into two PRs:
- General case.
- Dense dense case.
In either case, we need benchmarks comparing general case in this PR with the implementation on main
.
For reference, here is the code I'm using to benchmark. Here is the raw data in case you'd like to play around with it. For the dense/dense case, here is a comparison of their average run time (averaged over 7 trials each): Here is the graph of the improvement ratio (time taken on Same graph for a sparse Indeed, as expected there's only significant improvement when Otherwise, if |
After incorporating the changes @thomasjpfan suggested, the cythonized version ~6-8x faster in the dense/dense case, and at worst is now slower than the python implementation, even for the sparse/sparse case. I'll be providing benchmark results soon. |
…ikit-learn into cython_dump_svmlight
@jeremiedbb @ogrisel Wondering if either of you two would be interested in reviewing this :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Micky774, nice to meet you.
Thank you for this contribution. Here are a few comments and questions.
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nits otherwise LGTM
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
…ikit-learn into cython_dump_svmlight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you, @Micky774.
Looks like this PR does not compile with Cython alpha release ... more info in #23668 |
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Reference Issues/PRs
Fixes #15527
What does this implement/fix? Explain your changes.
Provides a Cython implementation of
_dump_svmlight_file
. This implementation provides significant (5-8x) speed-up on dense inputs, but no speedup on sparse inputs.Any other comments?
While working on this, I realized that while we omit zero-values in the label array
y
, we also (indirectly) require that ify
is sparse thenlen(y.data)==len(X)
since we iterate through them both simultaneously w/ the same loop. This seems odd and potentially erroneous, but I'm not experienced w/ the format so I wanted to see if anyone had any thoughts/clarifications regarding this.To Do