Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Perform KNN imputation without O(n^2) memory cost #16397

Merged
merged 6 commits into from Feb 24, 2020

Conversation

@jnothman
Copy link
Member

@jnothman jnothman commented Feb 6, 2020

Fixes #15604

This is more computationally expensive than the previous implementation,
but should reduce memory costs substantially in common use cases.

Sorry for duplicating your effort, @thomasjpfan, if you had already attempted this.

The KNNImputer is a pretty difficult piece of code to work with.

jnothman added 2 commits Feb 6, 2020
Fixes #15604

This is more computationally expensive than the previous implementation,
but should reduce memory costs substantially in common use cases.
@jnothman
Copy link
Member Author

@jnothman jnothman commented Feb 6, 2020

the uncovered lines are uncovered in master...

@ajing
Copy link

@ajing ajing commented Feb 6, 2020

When will this be available? How could I try this new function before merging?

@jnothman
Copy link
Member Author

@jnothman jnothman commented Feb 6, 2020

When will this be available? How could I try this new function before merging?

You can pull this branch into your local working copy... Or try pip install https://github.com/jnothman/scikit-learn/archive/knnimpute-memory.zip

@thomasjpfan thomasjpfan self-requested a review Feb 7, 2020
@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Feb 13, 2020

By running the code snippet similar to #15604 (comment)

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.impute import KNNImputer
import pandas as pd

calhousing = fetch_california_housing()

X = pd.DataFrame(calhousing.data, columns=calhousing.feature_names)
y = pd.Series(calhousing.target, name='house_value')

rng = np.random.RandomState(42)

density = 4  # one in 10 values will be NaN

mask = rng.randint(density, size=X.shape) == 0
X_na = X.copy()
X_na.values[mask] = np.nan
X_na = StandardScaler().fit_transform(X_na)

knn = KNNImputer()

This PR

%%memit
knn.fit_transform(X_na)
# peak memory: 3468.01 MiB, increment: 3345.06 MiB

Master

%%memit
knn.fit_transform(X_na)
# peak memory: 6371.18 MiB, increment: 6245.66 MiB

sklearn/impute/tests/test_knn.py Show resolved Hide resolved
sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved
@impiyush
Copy link

@impiyush impiyush commented Feb 20, 2020

When will this be available? How could I try this new function before merging?

You can pull this branch into your local working copy... Or try pip install https://github.com/jnothman/scikit-learn/archive/knnimpute-memory.zip

I was facing the same memory error and the imputer kept crashing. Thanks to this PR and sharing the link to pull this into my local copy, I was able to move forward in my project.

@glemaitre glemaitre added this to TO REVIEW in Guillaume's pet Feb 20, 2020
@glemaitre glemaitre self-requested a review Feb 20, 2020
Copy link
Contributor

@glemaitre glemaitre left a comment

LGTM. I am just not sure about the warning if it could be fine to filter it to make it obvious that we are expecting it for this test?

sklearn/impute/_knn.py Show resolved Hide resolved
sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved
sklearn/impute/tests/test_knn.py Show resolved Hide resolved
doc/whats_new/v0.23.rst Outdated Show resolved Hide resolved
@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Feb 20, 2020

Oh these lines were not covered by the test before as well. So still LGTM.
@thomasjpfan we should probably add a test case if possible to cover these lines. This is independent from this PR.

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Feb 20, 2020

@thomasjpfan In which case is it that we will reach this part of the code?

@glemaitre glemaitre moved this from TO REVIEW to REVIEWED AND WAITING FOR CHANGES in Guillaume's pet Feb 20, 2020
@glemaitre glemaitre moved this from REVIEWED AND WAITING FOR CHANGES to GETTING STALLED in Guillaume's pet Feb 20, 2020
@glemaitre glemaitre moved this from GETTING STALLED to TO BE MERGED in Guillaume's pet Feb 20, 2020
@glemaitre glemaitre self-assigned this Feb 21, 2020
@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Feb 21, 2020

@jnothman Do you want me to push the small changes if you have limited time? They are only nitpicking which I am able to do :)

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Feb 23, 2020

LGTM @thomasjpfan do you want to have a final look. I think this good to be merged.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

LGTM

@thomasjpfan thomasjpfan merged commit 244d118 into scikit-learn:master Feb 24, 2020
15 of 16 checks passed
15 of 16 checks passed
@codecov
codecov/patch 96.29% of diff hit (target 97.90%)
Details
@lgtm-com
LGTM analysis: C/C++ No code changes detected
Details
@lgtm-com
LGTM analysis: JavaScript No code changes detected
Details
@lgtm-com
LGTM analysis: Python No new or fixed alerts
Details
@codecov
codecov/project 97.90% (+0.00%) compared to 17e6c6b
Details
@azure-pipelines
scikit-learn.scikit-learn Build #20200223.2 succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Linting) Linting succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Linux py36_conda_openblas) Linux py36_conda_openblas succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Linux py36_ubuntu_atlas) Linux py36_ubuntu_atlas succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Linux32 py36_ubuntu_atlas_32bit) Linux32 py36_ubuntu_atlas_32bit succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Linux_Runs pylatest_conda_mkl) Linux_Runs pylatest_conda_mkl succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Windows py36_pip_openblas_32bit) Windows py36_pip_openblas_32bit succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (macOS pylatest_conda_mkl_no_openmp) macOS pylatest_conda_mkl_no_openmp succeeded
Details
@glemaitre glemaitre moved this from TO BE MERGED to MERGED in Guillaume's pet Feb 24, 2020
@jnothman
Copy link
Member Author

@jnothman jnothman commented Feb 24, 2020

@jnothman jnothman added this to the 0.22.2 milestone Feb 24, 2020
jeremiedbb added a commit to jeremiedbb/scikit-learn that referenced this pull request Feb 28, 2020
ogrisel added a commit that referenced this pull request Feb 28, 2020
* FIX ensure object array are properly casted when dtype=object (#16076)

* DOC Docstring example of classifier should import classifier (#16430)

* MNT Update nightly build URL and release staging config (#16435)

* BUG ensure that estimator_name is properly stored in the ROC display (#16500)

* BUG ensure that name is properly stored in the precision/recall display (#16505)

* ENH Perform KNN imputation without O(n^2) memory cost (#16397)

* bump scikit-learn version for binder

* bump version to 0.22.2

* MNT Skips failing SpectralCoclustering doctest (#16232)

* TST Updates test for deprecation in pandas.SparseArray (#16040)

* move 0.22.2 what's new entries (#16586)

* add 0.22.2 in the news of the web site frontpage

* skip test_ard_accuracy_on_easy_problem

Co-authored-by: alexshacked <al.shacked@gmail.com>
Co-authored-by: Oleksandr Pavlyk <oleksandr-pavlyk@users.noreply.github.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
gio8tisu added a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Linked issues

Successfully merging this pull request may close these issues.

6 participants