ENH Allow for appropriate dtype us in `preprocessing.PolynomialFeatures` for sparse matrices #23731

Micky774 · 2022-06-22T18:07:41Z

Reference Issues/PRs

Fixes #16803
Fixes #17554
Resolves #19676 (stalled)
Resolves #20524 (stalled)

What does this implement/fix? Explain your changes.

PR #20524: Calculates number of non-zero terms for each degree (row-wise) and creates dense arrays for data/indices/indptr to pass to Cython _csr_polynomial_expansion. Since the size is known a-priori, the appropriate dtype can be used during construction. The use of fused types in _csr_polynomial_expansion allows for only the minimally sufficient index dtype to be used, decreasing wasted memory when int32 is sufficient.

This PR: reconciles w/ main and makes minor changes.

Any other comments?

The full functionality of this PR is really only enabled in scipy_version>1.8 since it depends on an upstream bug fix

…(bis) Fixes scikit-learn#19676

…nto csr_polynomial

ogrisel

Thanks for the PR. Here is a first pass of feedback.

doc/whats_new/v1.2.rst

sklearn/preprocessing/_csr_polynomial_expansion.pyx

sklearn/preprocessing/_polynomial.py

sklearn/preprocessing/tests/test_polynomial.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

sklearn/preprocessing/_polynomial.py

sklearn/preprocessing/_csr_polynomial_expansion.pyx

sklearn/preprocessing/tests/test_polynomial.py

sklearn/preprocessing/_polynomial.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jjerphan

Thank you, @Micky774.

A few comments.

Note that a specific dtype and C type have been added for sparse matrices indices.

scikit-learn/sklearn/utils/_typedefs.pxd

Lines 19 to 28 in b157ac7

    
           # scipy matrices indices dtype (namely for indptr and indices arrays) 
        
           # 
        
           #   Note that indices might need to be represented as cnp.int64_t. 
        
           #   Currently, we use Cython classes which do not handle fused types 
        
           #   so we hardcode this type to cnp.int32_t, supporting all but edge 
        
           #   cases. 
        
           # 
        
           # TODO: support cnp.int64_t for this case 
        
           # See: https://github.com/scikit-learn/scikit-learn/issues/23653 
        
           ctypedef cnp.int32_t SPARSE_INDEX_TYPE_t

Could we propagate this here for semantics?

sklearn/preprocessing/_csr_polynomial_expansion.pyx

sklearn/preprocessing/tests/test_polynomial.py

Micky774 · 2022-08-18T22:03:04Z

Note that a specific dtype and C type have been added for sparse matrices indices.

Not sure how to incorporate this while retaining functionality -- the cython here requires {32,64}bit fused-types and there's not any place that would still work if instead using a single-type SPARSE_INDEX_TYPE_t. Sorry if I'm getting a bit confused 😅

jjerphan · 2022-08-22T07:58:07Z

Not sure how to incorporate this while retaining functionality -- the cython here requires {32,64}bit fused-types and there's not any place that would still work if instead using a single-type SPARSE_INDEX_TYPE_t. Sorry if I'm getting a bit confused sweat_smile

Yes, in retrospective the definition of SPARSE_INDEX_TYPE_t is to me confusing if we want to handle sparse matrices' indices types. As type and dtype definitions for Cython implementations' need to be reworked, I think it is fine hard-coding cnp.int{32,64}_t directly to support both sparse matrices cases without using SPARSE_INDEX_TYPE_t.

Micky774 · 2022-08-28T20:51:22Z

sklearn/preprocessing/_csr_polynomial_expansion.pyx

+cdef inline cnp.int64_t _deg3_column(
+    cnp.int64_t d,
+    cnp.int64_t i,
+    cnp.int64_t j,
+    cnp.int64_t k,
+    FLAG_t interaction_only
+    ) nogil:


I think the greatest pain point of the current implementation is that in order to compute the output index we must compute squares or cubes of indices (e.g. i,j,k) and while the output index may fit within int{32,64}, the intermediate square/cube calculation may overflow, throwing the entire thing off. This effectively limits the valid range of input indices.

I'm open to any suggestions for how to circumvent this issue.

As NumPy does not defines int128_t and does define npy_int128 to be identical to int64_t, I think we can to compute squared or cubes of indices try using np.uint64_t-typed variables (which are unsigned long long int-typed variables)

Alternatively, we could try to rework the formula not to have huge values.

Alright I ended up predicting when an intermediate overflow would occur, and then deferring to python in those cases (to make use of arbitrary-precision PyLongs). This only occurs for the cases where the indices are too large to safely compute the output index, and so won't kick into effect unless the input data has billions of features in the best case, and 500k in the worst case. Even when it kicks in, the performance drop only affects exactly the elements with too-large indices, which I think is reasonable to assume would be the minority of the data.

Quick benchmarks:

from sklearn.preprocessing import PolynomialFeatures from scipy import sparse X = sparse.random(100_000, 1000, random_state=0) pf = PolynomialFeatures(interaction_only=False, include_bias=True, degree=2) %timeit -n9 pf.fit_transform(X)

Cython only: 774 ms ± 7.6 ms per loop (mean ± std. dev. of 7 runs, 9 loops each) PR: 784 ms ± 4.3 ms per loop (mean ± std. dev. of 7 runs, 9 loops each)

Micky774 · 2022-08-30T20:53:41Z

@ogrisel @jjerphan This should be ready for review again if you'd be interested.

jjerphan

Thank you, @Micky774.

Here is a first review.

Potentially this PR can be split into two PR:

a first one for the fix for hstack
a second one for the fix for preprocessing.PolynomialFeatures.

sklearn/preprocessing/_csr_hstack.pyx

sklearn/preprocessing/_csr_polynomial_expansion.pyx

sklearn/preprocessing/setup.py

sklearn/preprocessing/_polynomial.py

sklearn/preprocessing/_csr_hstack.pyx

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

jjerphan

LGTM when the CI is green.

Since the scope of the PR has changed, I think this can be renamed to, for instance:

FIX Err in `preprocessing.PolynomialFeatures` if the output's indices and indptr are not representable

jjerphan · 2022-10-20T06:46:56Z

sklearn/preprocessing/tests/test_polynomial.py

+    if (
+        sp_version < parse_version("1.9.2")
+        and n_features == 65535
+        and not interaction_only
+    ):


Suggested change

if (

sp_version < parse_version("1.9.2")

and n_features == 65535

and not interaction_only

):

if (

sp_version < parse_version("1.9.2")

and n_features >= 65535

and not interaction_only

):

jjerphan · 2022-10-20T06:46:56Z

sklearn/preprocessing/_polynomial.py

+                        "Due to an error in `scipy.sparse.hstack` present in versions"
+                        " `<1.9.2`, stacking sparse matrices such that the resulting"
+                        " matrix would have `n_cols` too large to be represented by"
+                        " 32bit integers results in negative columns. To avoid this"
+                        " error, either use a version of scipy `>=1.9.2` or alter the"
+                        " `PolynomialFeatures` transformer to produce fewer output"
+                        " features."


A suggestion to mention that using np.int64 for indices and indptr should work

Suggested change

"Due to an error in `scipy.sparse.hstack` present in versions"

" `<1.9.2`, stacking sparse matrices such that the resulting"

" matrix would have `n_cols` too large to be represented by"

" 32bit integers results in negative columns. To avoid this"

" error, either use a version of scipy `>=1.9.2` or alter the"

" `PolynomialFeatures` transformer to produce fewer output"

" features."

"Due to a bug in `scipy.sparse.hstack` present in SciPy<1.9.2,"

" stacking sparse matrices such that the resulting matrix would"

" have its `n_cols` too large to be represented by 32bit

" integers results in negative column indices.\n"

" To avoid this error, either use `scipy>=1.9.2`, convert your"

" input matrix `indices` and `indptr` array to use `np.int64`

" instead of `np.int32`, or alter the `PolynomialFeatures`

" transformer to produce fewer output features."

niuk-a and others added 6 commits Jul 13, 2021

[WIP] FIX index overflow error in sparse matrix polynomial expansion …

7eef7ad

…(bis) Fixes scikit-learn#19676

Merge branch 'main' into csr_polynomial

4adbf38

Reconciled with main

baa98a2

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

2b9187d

…nto csr_polynomial

Merge branch 'main' into csr_polynomial

55424a0

Removed extra total_nnz assignment

69438dc

github-actions bot added the cython label Jun 22, 2022

Micky774 changed the title ~~ENH Allow for appropriate dtype us in preprocessing.PolynomialFeatures::_csr_polynomial_expansion~~ ENH Allow for appropriate dtype us in preprocessing.PolynomialFeatures for sparse matrices Jun 22, 2022

github-actions bot added the module:preprocessing label Jun 22, 2022

Micky774 added 6 commits Jun 22, 2022

Added fused type

9ecbf8a

Added clarifying comment

345e043

Merge branch 'main' into csr_polynomial

1d23b1d

Added changelog entry

cc6a548

Merge branch 'main' into csr_polynomial

8b189bb

Fixed PR tag in changelog entry

15b00fd

ogrisel reviewed Jun 27, 2022

View changes

Micky774 and others added 2 commits Jun 29, 2022

Apply suggestions from code review

ee8a3ba

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Merge branch 'main' into csr_polynomial

cd346f1

Micky774 commented Jun 29, 2022

View changes

sklearn/preprocessing/_polynomial.py Outdated Show resolved Hide resolved

Micky774 and others added 7 commits Jun 29, 2022

Streamlined logic and improved tests

0a17dee

Added test depending on scipy version

b118a3c

Clarified breaking and renamed types

fa1ecf2

Merge branch 'main' into csr_polynomial

d735c8f

Merge branch 'main' into csr_polynomial

f3bb5cd

Merge branch 'main' into csr_polynomial

0c9a563

Merge branch 'main' into csr_polynomial

a006bf0

ogrisel reviewed Jul 4, 2022

View changes

Apply suggestions from code review

96259d7

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jjerphan reviewed Jul 7, 2022

View changes

Merge branch 'main' into csr_polynomial

2e44f39

Micky774 added 2 commits Aug 18, 2022

Simplified and cleaned up implementation

5be9a13

Slightly better formatting and variable name

27974ba

lorentzenchr mentioned this pull request Aug 18, 2022

ENH: BSplines.design_matrix performance improvement scipy/scipy#16840

Merged

Micky774 added 4 commits Aug 18, 2022

Fixed dtype bug and added testing

cec3005

Merge branch 'main' into csr_polynomial

5116a1d

Updated test to verify nnz count and indices

764d8bd

Improved dtype resolution and clarified with comments

102e2fa

Micky774 added 3 commits Aug 18, 2022

Fixed inexact index error

8430c3f

Updated formatting

db78c7e

Cleaner diff and blame history

057a4f5

Micky774 added 2 commits Aug 28, 2022

Merge branch 'main' into csr_polynomial

23e9acf

Fixed overflow bug in expanded index calculation

46745a8

Micky774 commented Aug 28, 2022

View changes

Micky774 added 3 commits Aug 30, 2022

Fix intermediate calculation overflow and refactor tests

9ff8413

Merge branch 'main' into csr_polynomial

5a221f2

Fixed duplicated changelog entries

baea39e

jjerphan reviewed Sep 5, 2022

View changes

Micky774 mentioned this pull request Sep 13, 2022

FIX: Updated dtype resolution in _stack_along_minor_axis scipy/scipy#16628

Merged

Micky774 and others added 3 commits Oct 6, 2022

Merge branch 'main' into csr_polynomial

ff3d050

Apply suggestions from code review

54c7d2e

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Update comment for scipy min version (new backport)

1c8a98b

Micky774 mentioned this pull request Oct 6, 2022

FIX Introduces utils.fixes.csr_hstack #24595

Closed

Micky774 added 4 commits Oct 19, 2022

Removed vendored csr_hstack and instead error where appropriate

016ae5b

Merge branch 'main' into csr_polynomial

0e14c8d

Updated error message

a5c17dc

CLN Add authorship and delete cosmetic changes

34e7d2a

jjerphan approved these changes Oct 20, 2022

View changes

ENH Allow for appropriate dtype us in `preprocessing.PolynomialFeatures` for sparse matrices #23731

ENH Allow for appropriate dtype us in `preprocessing.PolynomialFeatures` for sparse matrices #23731

Micky774 commented Jun 22, 2022 •

edited

ogrisel left a comment

jjerphan left a comment

Micky774 commented Aug 18, 2022

jjerphan commented Aug 22, 2022 •

edited

Micky774 Aug 28, 2022

jjerphan Aug 29, 2022 •

edited

Micky774 Aug 30, 2022 •

edited

Micky774 commented Aug 30, 2022

jjerphan left a comment

jjerphan left a comment •

edited

jjerphan Oct 20, 2022

jjerphan Oct 20, 2022

	# scipy matrices indices dtype (namely for indptr and indices arrays)
	#
	# Note that indices might need to be represented as cnp.int64_t.
	# Currently, we use Cython classes which do not handle fused types
	# so we hardcode this type to cnp.int32_t, supporting all but edge
	# cases.
	#
	# TODO: support cnp.int64_t for this case
	# See: https://github.com/scikit-learn/scikit-learn/issues/23653
	ctypedef cnp.int32_t SPARSE_INDEX_TYPE_t

-                        "Due to an error in `scipy.sparse.hstack` present in versions"
-                        " `<1.9.2`, stacking sparse matrices such that the resulting"
-                        " matrix would have `n_cols` too large to be represented by"
-                        " 32bit integers results in negative columns. To avoid this"
-                        " error, either use a version of scipy `>=1.9.2` or alter the"
-                        " `PolynomialFeatures` transformer to produce fewer output"
-                        " features."
+                        "Due to a bug in `scipy.sparse.hstack` present in SciPy<1.9.2,"
+                        " stacking sparse matrices such that the resulting matrix would"
+                        " have its `n_cols` too large to be represented by 32bit
+                        " integers results in negative column indices.\n"
+                        " To avoid this error, either use `scipy>=1.9.2`, convert your"
+                        " input matrix `indices` and `indptr` array to use `np.int64`
+                        " instead of `np.int32`, or alter the `PolynomialFeatures`
+                        " transformer to produce fewer output features."

ENH Allow for appropriate dtype us in preprocessing.PolynomialFeatures for sparse matrices #23731

Are you sure you want to change the base?

ENH Allow for appropriate dtype us in preprocessing.PolynomialFeatures for sparse matrices #23731

Conversation

Micky774 commented Jun 22, 2022 • edited

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

ogrisel left a comment

jjerphan left a comment

Micky774 commented Aug 18, 2022

jjerphan commented Aug 22, 2022 • edited

Micky774 Aug 28, 2022

Choose a reason for hiding this comment

jjerphan Aug 29, 2022 • edited

Choose a reason for hiding this comment

Micky774 Aug 30, 2022 • edited

Choose a reason for hiding this comment

Micky774 commented Aug 30, 2022

jjerphan left a comment

jjerphan left a comment • edited

jjerphan Oct 20, 2022

Choose a reason for hiding this comment

jjerphan Oct 20, 2022

Choose a reason for hiding this comment

ENH Allow for appropriate dtype us in `preprocessing.PolynomialFeatures` for sparse matrices #23731

ENH Allow for appropriate dtype us in `preprocessing.PolynomialFeatures` for sparse matrices #23731

Micky774 commented Jun 22, 2022 •

edited

jjerphan commented Aug 22, 2022 •

edited

jjerphan Aug 29, 2022 •

edited

Micky774 Aug 30, 2022 •

edited

jjerphan left a comment •

edited