Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Allow for appropriate dtype us in preprocessing.PolynomialFeatures for sparse matrices #23731

Open
wants to merge 54 commits into
base: main
Choose a base branch
from

Conversation

Micky774
Copy link
Contributor

@Micky774 Micky774 commented Jun 22, 2022

Reference Issues/PRs

Fixes #16803
Fixes #17554
Resolves #19676 (stalled)
Resolves #20524 (stalled)

What does this implement/fix? Explain your changes.

PR #20524: Calculates number of non-zero terms for each degree (row-wise) and creates dense arrays for data/indices/indptr to pass to Cython _csr_polynomial_expansion. Since the size is known a-priori, the appropriate dtype can be used during construction. The use of fused types in _csr_polynomial_expansion allows for only the minimally sufficient index dtype to be used, decreasing wasted memory when int32 is sufficient.

This PR: reconciles w/ main and makes minor changes.

Any other comments?

The full functionality of this PR is really only enabled in scipy_version>1.8 since it depends on an upstream bug fix

@Micky774 Micky774 changed the title ENH Allow for appropriate dtype us in preprocessing.PolynomialFeatures::_csr_polynomial_expansion ENH Allow for appropriate dtype us in preprocessing.PolynomialFeatures for sparse matrices Jun 22, 2022
Copy link
Member

@ogrisel ogrisel left a comment

Thanks for the PR. Here is a first pass of feedback.

doc/whats_new/v1.2.rst Outdated Show resolved Hide resolved
sklearn/preprocessing/_csr_polynomial_expansion.pyx Outdated Show resolved Hide resolved
sklearn/preprocessing/_csr_polynomial_expansion.pyx Outdated Show resolved Hide resolved
sklearn/preprocessing/_polynomial.py Show resolved Hide resolved
sklearn/preprocessing/_polynomial.py Outdated Show resolved Hide resolved
sklearn/preprocessing/tests/test_polynomial.py Outdated Show resolved Hide resolved
sklearn/preprocessing/tests/test_polynomial.py Outdated Show resolved Hide resolved
Micky774 and others added 2 commits Jun 29, 2022
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
sklearn/preprocessing/_polynomial.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_polynomial.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_csr_polynomial_expansion.pyx Outdated Show resolved Hide resolved
sklearn/preprocessing/tests/test_polynomial.py Outdated Show resolved Hide resolved
sklearn/preprocessing/tests/test_polynomial.py Outdated Show resolved Hide resolved
sklearn/preprocessing/tests/test_polynomial.py Outdated Show resolved Hide resolved
sklearn/preprocessing/tests/test_polynomial.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_polynomial.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_polynomial.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_polynomial.py Show resolved Hide resolved
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Copy link
Member

@jjerphan jjerphan left a comment

Thank you, @Micky774.

A few comments.

Note that a specific dtype and C type have been added for sparse matrices indices.

# scipy matrices indices dtype (namely for indptr and indices arrays)
#
# Note that indices might need to be represented as cnp.int64_t.
# Currently, we use Cython classes which do not handle fused types
# so we hardcode this type to cnp.int32_t, supporting all but edge
# cases.
#
# TODO: support cnp.int64_t for this case
# See: https://github.com/scikit-learn/scikit-learn/issues/23653
ctypedef cnp.int32_t SPARSE_INDEX_TYPE_t

Could we propagate this here for semantics?

sklearn/preprocessing/_csr_polynomial_expansion.pyx Outdated Show resolved Hide resolved
sklearn/preprocessing/_csr_polynomial_expansion.pyx Outdated Show resolved Hide resolved
sklearn/preprocessing/_csr_polynomial_expansion.pyx Outdated Show resolved Hide resolved
sklearn/preprocessing/tests/test_polynomial.py Outdated Show resolved Hide resolved
@Micky774
Copy link
Contributor Author

Micky774 commented Aug 18, 2022

Note that a specific dtype and C type have been added for sparse matrices indices.

Not sure how to incorporate this while retaining functionality -- the cython here requires {32,64}bit fused-types and there's not any place that would still work if instead using a single-type SPARSE_INDEX_TYPE_t. Sorry if I'm getting a bit confused 😅

@jjerphan
Copy link
Member

jjerphan commented Aug 22, 2022

Not sure how to incorporate this while retaining functionality -- the cython here requires {32,64}bit fused-types and there's not any place that would still work if instead using a single-type SPARSE_INDEX_TYPE_t. Sorry if I'm getting a bit confused sweat_smile

Yes, in retrospective the definition of SPARSE_INDEX_TYPE_t is to me confusing if we want to handle sparse matrices' indices types. As type and dtype definitions for Cython implementations' need to be reworked, I think it is fine hard-coding cnp.int{32,64}_t directly to support both sparse matrices cases without using SPARSE_INDEX_TYPE_t.

cdef inline cnp.int64_t _deg3_column(
cnp.int64_t d,
cnp.int64_t i,
cnp.int64_t j,
cnp.int64_t k,
FLAG_t interaction_only
) nogil:
Copy link
Contributor Author

@Micky774 Micky774 Aug 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the greatest pain point of the current implementation is that in order to compute the output index we must compute squares or cubes of indices (e.g. i,j,k) and while the output index may fit within int{32,64}, the intermediate square/cube calculation may overflow, throwing the entire thing off. This effectively limits the valid range of input indices.

I'm open to any suggestions for how to circumvent this issue.

Copy link
Member

@jjerphan jjerphan Aug 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As NumPy does not defines int128_t and does define npy_int128 to be identical to int64_t, I think we can to compute squared or cubes of indices try using np.uint64_t-typed variables (which are unsigned long long int-typed variables)

Alternatively, we could try to rework the formula not to have huge values.

Copy link
Contributor Author

@Micky774 Micky774 Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright I ended up predicting when an intermediate overflow would occur, and then deferring to python in those cases (to make use of arbitrary-precision PyLongs). This only occurs for the cases where the indices are too large to safely compute the output index, and so won't kick into effect unless the input data has billions of features in the best case, and 500k in the worst case. Even when it kicks in, the performance drop only affects exactly the elements with too-large indices, which I think is reasonable to assume would be the minority of the data.

Quick benchmarks:
from sklearn.preprocessing import PolynomialFeatures
from scipy import sparse

X = sparse.random(100_000, 1000, random_state=0)
pf = PolynomialFeatures(interaction_only=False, include_bias=True, degree=2)
%timeit -n9 pf.fit_transform(X)
Cython only: 774 ms ± 7.6 ms per loop (mean ± std. dev. of 7 runs, 9 loops each)
PR:          784 ms ± 4.3 ms per loop (mean ± std. dev. of 7 runs, 9 loops each)

@Micky774
Copy link
Contributor Author

Micky774 commented Aug 30, 2022

@ogrisel @jjerphan This should be ready for review again if you'd be interested.

Copy link
Member

@jjerphan jjerphan left a comment

Thank you, @Micky774.

Here is a first review.

Potentially this PR can be split into two PR:

  • a first one for the fix for hstack
  • a second one for the fix for preprocessing.PolynomialFeatures.

sklearn/preprocessing/_csr_hstack.pyx Outdated Show resolved Hide resolved
sklearn/preprocessing/_csr_hstack.pyx Outdated Show resolved Hide resolved
sklearn/preprocessing/_csr_polynomial_expansion.pyx Outdated Show resolved Hide resolved
sklearn/preprocessing/_csr_polynomial_expansion.pyx Outdated Show resolved Hide resolved
sklearn/preprocessing/_csr_polynomial_expansion.pyx Outdated Show resolved Hide resolved
sklearn/preprocessing/setup.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_polynomial.py Show resolved Hide resolved
sklearn/preprocessing/_polynomial.py Show resolved Hide resolved
sklearn/preprocessing/_polynomial.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_csr_hstack.pyx Outdated Show resolved Hide resolved
Copy link
Member

@jjerphan jjerphan left a comment

LGTM when the CI is green.

Since the scope of the PR has changed, I think this can be renamed to, for instance:

FIX Err in `preprocessing.PolynomialFeatures` if the output's indices and indptr are not representable

if (
sp_version < parse_version("1.9.2")
and n_features == 65535
and not interaction_only
):
Copy link
Member

@jjerphan jjerphan Oct 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (
sp_version < parse_version("1.9.2")
and n_features == 65535
and not interaction_only
):
if (
sp_version < parse_version("1.9.2")
and n_features >= 65535
and not interaction_only
):

"Due to an error in `scipy.sparse.hstack` present in versions"
" `<1.9.2`, stacking sparse matrices such that the resulting"
" matrix would have `n_cols` too large to be represented by"
" 32bit integers results in negative columns. To avoid this"
" error, either use a version of scipy `>=1.9.2` or alter the"
" `PolynomialFeatures` transformer to produce fewer output"
" features."
Copy link
Member

@jjerphan jjerphan Oct 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A suggestion to mention that using np.int64 for indices and indptr should work

Suggested change
"Due to an error in `scipy.sparse.hstack` present in versions"
" `<1.9.2`, stacking sparse matrices such that the resulting"
" matrix would have `n_cols` too large to be represented by"
" 32bit integers results in negative columns. To avoid this"
" error, either use a version of scipy `>=1.9.2` or alter the"
" `PolynomialFeatures` transformer to produce fewer output"
" features."
"Due to a bug in `scipy.sparse.hstack` present in SciPy<1.9.2,"
" stacking sparse matrices such that the resulting matrix would"
" have its `n_cols` too large to be represented by 32bit
" integers results in negative column indices.\n"
" To avoid this error, either use `scipy>=1.9.2`, convert your"
" input matrix `indices` and `indptr` array to use `np.int64`
" instead of `np.int32`, or alter the `PolynomialFeatures`
" transformer to produce fewer output features."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants