ENH Perform KNN imputation without O(n^2) memory cost #16397

jnothman · 2020-02-06T10:20:29Z

This is more computationally expensive than the previous implementation,
but should reduce memory costs substantially in common use cases.

Sorry for duplicating your effort, @thomasjpfan, if you had already attempted this.

The KNNImputer is a pretty difficult piece of code to work with.

Fixes #15604 This is more computationally expensive than the previous implementation, but should reduce memory costs substantially in common use cases.

jnothman · 2020-02-06T12:46:36Z

the uncovered lines are uncovered in master...

ajing · 2020-02-06T18:01:21Z

When will this be available? How could I try this new function before merging?

jnothman · 2020-02-06T20:37:13Z

When will this be available? How could I try this new function before merging?

You can pull this branch into your local working copy... Or try pip install https://github.com/jnothman/scikit-learn/archive/knnimpute-memory.zip

thomasjpfan · 2020-02-13T18:34:57Z

By running the code snippet similar to #15604 (comment)

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.impute import KNNImputer
import pandas as pd

calhousing = fetch_california_housing()

X = pd.DataFrame(calhousing.data, columns=calhousing.feature_names)
y = pd.Series(calhousing.target, name='house_value')

rng = np.random.RandomState(42)

density = 4  # one in 10 values will be NaN

mask = rng.randint(density, size=X.shape) == 0
X_na = X.copy()
X_na.values[mask] = np.nan
X_na = StandardScaler().fit_transform(X_na)

knn = KNNImputer()

This PR

%%memit
knn.fit_transform(X_na)
# peak memory: 3468.01 MiB, increment: 3345.06 MiB

Master

%%memit
knn.fit_transform(X_na)
# peak memory: 6371.18 MiB, increment: 6245.66 MiB

sklearn/impute/tests/test_knn.py

sklearn/metrics/pairwise.py

impiyush · 2020-02-20T06:07:51Z

When will this be available? How could I try this new function before merging?

You can pull this branch into your local working copy... Or try pip install https://github.com/jnothman/scikit-learn/archive/knnimpute-memory.zip

I was facing the same memory error and the imputer kept crashing. Thanks to this PR and sharing the link to pull this into my local copy, I was able to move forward in my project.

glemaitre

LGTM. I am just not sure about the warning if it could be fine to filter it to make it obvious that we are expecting it for this test?

sklearn/impute/_knn.py

sklearn/metrics/pairwise.py

sklearn/impute/tests/test_knn.py

doc/whats_new/v0.23.rst

glemaitre · 2020-02-20T12:53:30Z

Uhm thought the codecov error is weird: https://codecov.io/gh/scikit-learn/scikit-learn/compare/0c4252cc52ccb4f150e2e7564f40ff8af83f47cc...a6af8242801d6aeef3ba61c0021fce2556579609/diff#D3-272

glemaitre · 2020-02-20T13:27:59Z

Oh these lines were not covered by the test before as well. So still LGTM.
@thomasjpfan we should probably add a test case if possible to cover these lines. This is independent from this PR.

glemaitre · 2020-02-20T13:43:36Z

@thomasjpfan In which case is it that we will reach this part of the code?

sklearn/impute/_knn.py

glemaitre · 2020-02-21T18:41:37Z

@jnothman Do you want me to push the small changes if you have limited time? They are only nitpicking which I am able to do :)

sklearn/impute/tests/test_knn.py

glemaitre · 2020-02-23T12:10:43Z

LGTM @thomasjpfan do you want to have a final look. I think this good to be merged.

thomasjpfan

LGTM

jnothman · 2020-02-24T07:39:35Z

Thanks for the reviews!

…6397)

* FIX ensure object array are properly casted when dtype=object (#16076) * DOC Docstring example of classifier should import classifier (#16430) * MNT Update nightly build URL and release staging config (#16435) * BUG ensure that estimator_name is properly stored in the ROC display (#16500) * BUG ensure that name is properly stored in the precision/recall display (#16505) * ENH Perform KNN imputation without O(n^2) memory cost (#16397) * bump scikit-learn version for binder * bump version to 0.22.2 * MNT Skips failing SpectralCoclustering doctest (#16232) * TST Updates test for deprecation in pandas.SparseArray (#16040) * move 0.22.2 what's new entries (#16586) * add 0.22.2 in the news of the web site frontpage * skip test_ard_accuracy_on_easy_problem Co-authored-by: alexshacked <al.shacked@gmail.com> Co-authored-by: Oleksandr Pavlyk <oleksandr-pavlyk@users.noreply.github.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Joel Nothman <joel.nothman@gmail.com> Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>

…6397)

jnothman added 2 commits Feb 6, 2020

ENH Perform KNN imputation without O(n^2) memory cost

13a6d97

Fixes #15604 This is more computationally expensive than the previous implementation, but should reduce memory costs substantially in common use cases.

DOC Add what's new

a6af824

thomasjpfan self-requested a review Feb 7, 2020

thomasjpfan reviewed Feb 13, 2020

View changes

sklearn/impute/tests/test_knn.py Show resolved Hide resolved

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

glemaitre added this to TO REVIEW in Guillaume's pet Feb 20, 2020

glemaitre self-requested a review Feb 20, 2020

glemaitre approved these changes Feb 20, 2020

View changes

sklearn/impute/_knn.py Show resolved Hide resolved

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

sklearn/impute/tests/test_knn.py Show resolved Hide resolved

doc/whats_new/v0.23.rst Outdated Show resolved Hide resolved

glemaitre moved this from TO REVIEW to REVIEWED AND WAITING FOR CHANGES in Guillaume's pet Feb 20, 2020

glemaitre moved this from REVIEWED AND WAITING FOR CHANGES to GETTING STALLED in Guillaume's pet Feb 20, 2020

glemaitre moved this from GETTING STALLED to TO BE MERGED in Guillaume's pet Feb 20, 2020

swchoi727 reviewed Feb 21, 2020

View changes

sklearn/impute/_knn.py Show resolved Hide resolved

glemaitre self-assigned this Feb 21, 2020

jnothman added 3 commits Feb 22, 2020

Respond to reviews

2c5e053

Merge branch 'master' into knnimpute-memory

afcaf3f

fix ignore_warnings usage

b12e308

glemaitre reviewed Feb 22, 2020

View changes

sklearn/impute/tests/test_knn.py Outdated Show resolved Hide resolved

Use pytest.mark to ignore warnings

670e495

thomasjpfan approved these changes Feb 24, 2020

View changes

glemaitre moved this from TO BE MERGED to MERGED in Guillaume's pet Feb 24, 2020

jnothman added this to the 0.22.2 milestone Feb 24, 2020

jeremiedbb added a commit to jeremiedbb/scikit-learn that referenced this pull request Feb 28, 2020

ENH Perform KNN imputation without O(n^2) memory cost (scikit-learn#1…

1aebf8d

…6397)

panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020

ENH Perform KNN imputation without O(n^2) memory cost (scikit-learn#1…

91e5b8f

…6397)

gio8tisu added a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020

ENH Perform KNN imputation without O(n^2) memory cost (scikit-learn#1…

693eaba

…6397)

scikit-learn / scikit-learn Public

ENH Perform KNN imputation without O(n^2) memory cost #16397

ENH Perform KNN imputation without O(n^2) memory cost #16397

jnothman commented Feb 6, 2020

jnothman commented Feb 6, 2020

ajing commented Feb 6, 2020

jnothman commented Feb 6, 2020

thomasjpfan commented Feb 13, 2020

impiyush commented Feb 20, 2020

glemaitre left a comment

glemaitre commented Feb 20, 2020

glemaitre commented Feb 20, 2020

glemaitre commented Feb 20, 2020

glemaitre commented Feb 21, 2020

glemaitre commented Feb 23, 2020

thomasjpfan left a comment

jnothman commented Feb 24, 2020

scikit-learn / scikit-learn Public

Sponsor scikit-learn/scikit-learn

ENH Perform KNN imputation without O(n^2) memory cost #16397

ENH Perform KNN imputation without O(n^2) memory cost #16397

Conversation

jnothman commented Feb 6, 2020

jnothman commented Feb 6, 2020

ajing commented Feb 6, 2020

jnothman commented Feb 6, 2020

thomasjpfan commented Feb 13, 2020

This PR

Master

impiyush commented Feb 20, 2020

glemaitre left a comment

glemaitre commented Feb 20, 2020

glemaitre commented Feb 20, 2020

glemaitre commented Feb 20, 2020

glemaitre commented Feb 21, 2020

glemaitre commented Feb 23, 2020

thomasjpfan left a comment

jnothman commented Feb 24, 2020