New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Adds TargetEncoder #25334
ENH Adds TargetEncoder #25334
Conversation
Do we want to have |
To have a encoder that works for both classification and regression I think we'd have to detect the type of problem based on the vaules of I think the way this works is that when performing the change from "categories" to "encoded categories" After reading the code and example once my big picture thoughts are:
What is your plan? Is it ready to go or does it need more tweaking? |
|
What can the |
From the dev meeting, we thought that could put aside the multiclass on the side for the moment and support binary classification and regression. We need to make sure that the name of the encoder reflects that but we don't have to support all possible classification and regression problems at first. I will make a review having those points in mind. |
Since |
I agree. Having one encoder for all types of problems is nicer than having to choose. My question was "Why not take the dirty_cat implementation of |
Here are the key differences between this PR and dirty_cat's version:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import TargetRegressorEncoder
from dirty_cat import TargetEncoder as DirtyCatTargetEncoder
import numpy as np
rng = np.random.default_rng()
n_samples, n_features = 500_000, 20
X = rng.integers(0, high=30, size=(n_samples, n_features))
y = rng.standard_normal(size=n_samples) %%timeit
_ = TargetRegressorEncoder().fit_transform(X, y)
# 3.37 s ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
_ = DirtyCatTargetEncoder().fit_transform(X, y)
# 9.9 s ± 220 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) This PR is faster even when it's |
It's mostly to decide on how we want to extend the API for classification targets. Currently, this PR is the minimum requirement for regression targets. The core computation in this PR can be extended to classification without too much trouble.
I do not like how import numpy as np
from sklearn.utils.multiclass import type_of_target
type_of_target(np.asarray([1.0] * 10 + [2.0] * 30 + [4.0] * 10 + [5.0]))
# 'multiclass' I prefer two more explicit options:
I went with option 1 in this PR, but I am okay with either option. For option 2, I am +0.5 on having a |
After thinking about it a little more, I am okay with just inferring the target type with
|
I pushed 26d2429 to add a statistical non-regression test that ensures that the nested cross-validation in I am still no decided whether this should better be a pitfall-style example or a test. I think having it in a test makes it less likely to introduce a change in this encoder that would cause a silent regression. But maybe the existing tests are enough. I am curious to hear your feedback. |
smooth : "auto" or float, default="auto" | ||
The amount of mixing of the categorical encoding with the global target mean. A | ||
larger `smooth` value will put more weight on the global target mean. | ||
If `"auto"`, then `smooth` is set to an empirical Bayes estimate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should allow for an n_features
length parametrization here (e.g. list and/or dict with feature names as keys): the optimal smoothing for one feature is not necessarily optimal for others.
This can be done as follow-up PR though. It feels border-line YAGNI to me. In the mean time it possible to use a column-transformer to configure a per-feature TargetEncoder
instance with dedicated smooth
values per features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Based on test_target_encoding_for_linear_regression
I think we need to do the nested cross-val by default and break the usual fit_transform
<=> fit + transform
implicit equivalence of other scikit-learn estimators. In particular, I don't see how to compute the "real" training accuracy: to do so we would need a fit_score
method on pipelines (which could be a good idea by the way to save some redundant computation, but this is a digression).
Anyways, I don't see any other way around, and to me the protection against catastrophic overfitting caused by noisy high-cardinality categorical features outweighs the potentially surprising (but well documented) behavior of fit_transform
.
I think both are useful. I have not seen an example similar to your test case that demonstrates why the internal validation is useful. In a follow up PR, we can convert the test into a pitfall style example and link it in the docstring for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spotted another typo in the inline comment of the new test.
@lorentzenchr @betatim @glemaitre any more feedback? @jovan-stojanovic you might be interested in the new test: I checked that dirty_cat's |
@thomasjpfan I had a devil inspired idea at coffee: we could store a weakref to the traininset set at fit time to detect if the Still the weakref hack could still lead to surprising behaviors. For instance, while the following would work: X_train = load_dataset_from_disk()
X_trans_1 = target_transformer.fit_transform(X_train)
X_trans_2 = target_transformer.transform(X_train)
np.assert_allclose(X_trans_1, X_trans_2) this seemingly innocuous variation would fail: X_train = load_dataset_from_disk()
X_trans_1 = target_transformer.fit_transform(X_train)
X_train = load_dataset_from_disk()
X_trans_2 = target_transformer.transform(X_train)
np.assert_allclose(X_trans_1, X_trans_2) so overall, I am not 100% sure the weakref hack would be a usability improvement or not. Feel free to pretend that you haven't read this comment and not reply. I would perfectly understand. |
Another pitfall I discovered when experimenting with this PR: If you have a mix of informative and non-informative categorical features (e.g. However if you use the raw target encoded values of I see two possible solutions:
preprocessor = ColumnTransformer(
[
(
"categorical",
make_pipeline(TargetEncoder(), StandardScaler(shared_mean=True, shared_scale=True),
["f_i", "f_u"],
),
],
remainder=StandardScaler(),
) Option EDIT: we should probably do Even if we decide would also be I just wanted to brain-dump this here so that we can think about it when we work on a pitfall example for /cc @jovan-stojanovic who might also be interested for dirty_cat. |
I think it borders on being too magical. For example, if the data is sliced the same way or copied, the references are not the same: import numpy as np
import weakref
X = np.random.randn(10, 10)
X1 = X[:4]
X2 = X[:4]
X3 = X1.copy()
X1_ref = weakref.ref(X1)
assert X1 is X1_ref()
assert X2 is not X1_ref()
assert X3 is not X1_ref() For reference, cuML's TargetEncoder holds the training data and checks all the values.
At one point, I had something similar implemented in #17323 as the default. I think it's reasonable to use a scaled version of the target for encoding purposes. |
Yes, some MNIST pixels reach a value of 250 after rescaling (see Details below), which is quite out of distribution for a unit-norm Normal distribution. We should add a warning when some outputs of
Yes, this is a big limitation of import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler
mnist = fetch_openml("mnist_784", as_frame=False, parser="pandas")
X, y = mnist.data, mnist.target
X_scaled = StandardScaler().fit_transform(X)
max_values = X_scaled.max(axis=0)
fig, ax = plt.subplots()
image = ax.imshow(max_values.reshape(28, 28), cmap=plt.get_cmap("viridis", 6),
norm=LogNorm())
ax.set(xticks=[], yticks=[], title="Maximum value of each scaled MNIST feature")
fig.colorbar(image)
plt.show() |
Let's keep that in mind for a follow-up PR. But it we want to make it the default (which would probably be helpful), we should probably do that before the 1.3 release. |
Note that we could use a weakref + a concrete value check. But even that would feel to complex/magical. +0.5 for keeping the code as it is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only one nitpick.
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Merged! Thank you very much @thomasjpfan! |
Just an idea: For detection of the training set during |
The checks need to happen before encoding on the categorical variables. We could store the per feature category counts instead. Maybe with a few random probe records that contain several features with infrequent categories. |
The checks need to happen before encoding on the categorical variables. We could store the per feature category counts instead. Maybe with a few random probe records. But this would be quite catastrophic in case of false positives. |
Now I see the difficulty. Maybe it is good enough as is. In principle, we would need to detect every single row of the training set and that’s the responsibility of the user, isn‘t it. |
Whoop whoop! Nice work! |
Pretty nice addition, thanks for this. A small question: According to the >>> from sklearn.preprocessing import TargetEncoder
>>> from sklearn.utils._tags import _safe_tags
>>> _safe_tags(TargetEncoder())['requires_y']
False |
Reference Issues/PRs
Closes #5853
Closes #9614
Supersedes #17323
Fixes or at least related to #24967
What does this implement/fix? Explain your changes.
This PR implements a target encoder which uses CV during
fit_transform
to prevent the target from leaking.transform
uses the the target encoding from all the training data. This means thatfit_transform()
!=fit().transform()
.The implementation uses Cython to learn the encoding which provides a 10x speed up compared to using a pure Python+NumPy approach. Cython is required because many encodings are learn during cross validation in
fit_transform
.Any other comments?
The implementation uses the same scheme as cuML's TargetEncoder, which they used to win Recsys2020.