New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Allow for appropriate dtype us in preprocessing.PolynomialFeatures
for sparse matrices
#23731
base: main
Are you sure you want to change the base?
Conversation
…nto csr_polynomial
preprocessing.PolynomialFeatures::_csr_polynomial_expansion
preprocessing.PolynomialFeatures
for sparse matrices
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Thank you, @Micky774.
A few comments.
Note that a specific dtype and C type have been added for sparse matrices indices.
scikit-learn/sklearn/utils/_typedefs.pxd
Lines 19 to 28 in b157ac7
# scipy matrices indices dtype (namely for indptr and indices arrays) | |
# | |
# Note that indices might need to be represented as cnp.int64_t. | |
# Currently, we use Cython classes which do not handle fused types | |
# so we hardcode this type to cnp.int32_t, supporting all but edge | |
# cases. | |
# | |
# TODO: support cnp.int64_t for this case | |
# See: https://github.com/scikit-learn/scikit-learn/issues/23653 | |
ctypedef cnp.int32_t SPARSE_INDEX_TYPE_t |
Could we propagate this here for semantics?
Not sure how to incorporate this while retaining functionality -- the cython here requires |
Yes, in retrospective the definition of |
cdef inline cnp.int64_t _deg3_column( | ||
cnp.int64_t d, | ||
cnp.int64_t i, | ||
cnp.int64_t j, | ||
cnp.int64_t k, | ||
FLAG_t interaction_only | ||
) nogil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the greatest pain point of the current implementation is that in order to compute the output index we must compute squares or cubes of indices (e.g. i,j,k
) and while the output index may fit within int{32,64}
, the intermediate square/cube calculation may overflow, throwing the entire thing off. This effectively limits the valid range of input indices.
I'm open to any suggestions for how to circumvent this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As NumPy does not defines int128_t
and does define npy_int128
to be identical to int64_t
, I think we can to compute squared or cubes of indices try using np.uint64_t
-typed variables (which are unsigned long long int
-typed variables)
Alternatively, we could try to rework the formula not to have huge values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright I ended up predicting when an intermediate overflow would occur, and then deferring to python in those cases (to make use of arbitrary-precision PyLong
s). This only occurs for the cases where the indices are too large to safely compute the output index, and so won't kick into effect unless the input data has billions of features in the best case, and 500k in the worst case. Even when it kicks in, the performance drop only affects exactly the elements with too-large indices, which I think is reasonable to assume would be the minority of the data.
Quick benchmarks:
from sklearn.preprocessing import PolynomialFeatures
from scipy import sparse
X = sparse.random(100_000, 1000, random_state=0)
pf = PolynomialFeatures(interaction_only=False, include_bias=True, degree=2)
%timeit -n9 pf.fit_transform(X)
Cython only: 774 ms ± 7.6 ms per loop (mean ± std. dev. of 7 runs, 9 loops each)
PR: 784 ms ± 4.3 ms per loop (mean ± std. dev. of 7 runs, 9 loops each)
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
if ( | ||
sp_version < parse_version("1.9.2") | ||
and n_features == 65535 | ||
and not interaction_only | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ( | |
sp_version < parse_version("1.9.2") | |
and n_features == 65535 | |
and not interaction_only | |
): | |
if ( | |
sp_version < parse_version("1.9.2") | |
and n_features >= 65535 | |
and not interaction_only | |
): |
"Due to an error in `scipy.sparse.hstack` present in versions" | ||
" `<1.9.2`, stacking sparse matrices such that the resulting" | ||
" matrix would have `n_cols` too large to be represented by" | ||
" 32bit integers results in negative columns. To avoid this" | ||
" error, either use a version of scipy `>=1.9.2` or alter the" | ||
" `PolynomialFeatures` transformer to produce fewer output" | ||
" features." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A suggestion to mention that using np.int64
for indices
and indptr
should work
"Due to an error in `scipy.sparse.hstack` present in versions" | |
" `<1.9.2`, stacking sparse matrices such that the resulting" | |
" matrix would have `n_cols` too large to be represented by" | |
" 32bit integers results in negative columns. To avoid this" | |
" error, either use a version of scipy `>=1.9.2` or alter the" | |
" `PolynomialFeatures` transformer to produce fewer output" | |
" features." | |
"Due to a bug in `scipy.sparse.hstack` present in SciPy<1.9.2," | |
" stacking sparse matrices such that the resulting matrix would" | |
" have its `n_cols` too large to be represented by 32bit | |
" integers results in negative column indices.\n" | |
" To avoid this error, either use `scipy>=1.9.2`, convert your" | |
" input matrix `indices` and `indptr` array to use `np.int64` | |
" instead of `np.int32`, or alter the `PolynomialFeatures` | |
" transformer to produce fewer output features." |
Reference Issues/PRs
Fixes #16803
Fixes #17554
Resolves #19676 (stalled)
Resolves #20524 (stalled)
What does this implement/fix? Explain your changes.
PR #20524: Calculates number of non-zero terms for each degree (row-wise) and creates dense arrays for
data/indices/indptr
to pass to Cython_csr_polynomial_expansion
. Since the size is known a-priori, the appropriate dtype can be used during construction. The use of fused types in_csr_polynomial_expansion
allows for only the minimally sufficient index dtype to be used, decreasing wasted memory whenint32
is sufficient.This PR: reconciles w/ main and makes minor changes.
Any other comments?
The full functionality of this PR is really only enabled in
scipy_version>1.8
since it depends on an upstream bug fix