ENH Perform KNN imputation without O(n^2) memory cost #16397
Conversation
Fixes #15604 This is more computationally expensive than the previous implementation, but should reduce memory costs substantially in common use cases.
the uncovered lines are uncovered in master... |
When will this be available? How could I try this new function before merging? |
You can pull this branch into your local working copy... Or try |
By running the code snippet similar to #15604 (comment) import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.impute import KNNImputer
import pandas as pd
calhousing = fetch_california_housing()
X = pd.DataFrame(calhousing.data, columns=calhousing.feature_names)
y = pd.Series(calhousing.target, name='house_value')
rng = np.random.RandomState(42)
density = 4 # one in 10 values will be NaN
mask = rng.randint(density, size=X.shape) == 0
X_na = X.copy()
X_na.values[mask] = np.nan
X_na = StandardScaler().fit_transform(X_na)
knn = KNNImputer() This PR%%memit
knn.fit_transform(X_na)
# peak memory: 3468.01 MiB, increment: 3345.06 MiB Master%%memit
knn.fit_transform(X_na)
# peak memory: 6371.18 MiB, increment: 6245.66 MiB |
I was facing the same memory error and the imputer kept crashing. Thanks to this PR and sharing the link to pull this into my local copy, I was able to move forward in my project. |
LGTM. I am just not sure about the warning if it could be fine to filter it to make it obvious that we are expecting it for this test? |
Uhm thought the codecov error is weird: https://codecov.io/gh/scikit-learn/scikit-learn/compare/0c4252cc52ccb4f150e2e7564f40ff8af83f47cc...a6af8242801d6aeef3ba61c0021fce2556579609/diff#D3-272 |
Oh these lines were not covered by the test before as well. So still LGTM. |
@thomasjpfan In which case is it that we will reach this part of the code? |
@jnothman Do you want me to push the small changes if you have limited time? They are only nitpicking which I am able to do :) |
LGTM @thomasjpfan do you want to have a final look. I think this good to be merged. |
LGTM |
244d118
into
scikit-learn:master
Thanks for the reviews!
|
* FIX ensure object array are properly casted when dtype=object (#16076) * DOC Docstring example of classifier should import classifier (#16430) * MNT Update nightly build URL and release staging config (#16435) * BUG ensure that estimator_name is properly stored in the ROC display (#16500) * BUG ensure that name is properly stored in the precision/recall display (#16505) * ENH Perform KNN imputation without O(n^2) memory cost (#16397) * bump scikit-learn version for binder * bump version to 0.22.2 * MNT Skips failing SpectralCoclustering doctest (#16232) * TST Updates test for deprecation in pandas.SparseArray (#16040) * move 0.22.2 what's new entries (#16586) * add 0.22.2 in the news of the web site frontpage * skip test_ard_accuracy_on_easy_problem Co-authored-by: alexshacked <al.shacked@gmail.com> Co-authored-by: Oleksandr Pavlyk <oleksandr-pavlyk@users.noreply.github.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Joel Nothman <joel.nothman@gmail.com> Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
Fixes #15604
This is more computationally expensive than the previous implementation,
but should reduce memory costs substantially in common use cases.
Sorry for duplicating your effort, @thomasjpfan, if you had already attempted this.
The KNNImputer is a pretty difficult piece of code to work with.