ENH Adds TargetEncoder #25334

thomasjpfan · 2023-01-08T21:24:57Z

Reference Issues/PRs

Closes #5853
Closes #9614
Supersedes #17323
Fixes or at least related to #24967

What does this implement/fix? Explain your changes.

This PR implements a target encoder which uses CV during fit_transform to prevent the target from leaking. transform uses the the target encoding from all the training data. This means that fit_transform() != fit().transform().

The implementation uses Cython to learn the encoding which provides a 10x speed up compared to using a pure Python+NumPy approach. Cython is required because many encodings are learn during cross validation in fit_transform.

Any other comments?

The implementation uses the same scheme as cuML's TargetEncoder, which they used to win Recsys2020.

glemaitre · 2023-01-09T09:09:43Z

Do we want to have Regressor in the naming since we could make the encoder work on both regression and classification problem?

sklearn/preprocessing/_target_encoder.py

examples/preprocessing/plot_target_regressor_encoder.py

betatim · 2023-01-09T11:08:17Z

Do we want to have Regressor in the naming since we could make the encoder work on both regression and classification problem?

To have a encoder that works for both classification and regression I think we'd have to detect the type of problem based on the vaules of y. How good is the automatic detection of "regression vs classification"? In addition it feels like the code for classification, in particular multi-class classification, would be more complicated than for the regression case (you can't use np.mean(y)).

I think the way this works is that when performing the change from "categories" to "encoded categories" _BaseEncoder._transform returns a version of X where a column representing a categorical feature is replaced by a column of integers. This is then translated to the actual encoded value in _transform_X_int. On top we have the complexity of fit().transform() and fit_transform() not doing exactly the same thing. It took a lot of going back and forth and working out what _BaseEncoder does to get to that understanding.

After reading the code and example once my big picture thoughts are:

nice!
this looks like it should work,
can we have more doc strings and comments that explain why something is the way it is
both doc string and naming stuff should be done after code structure, etc
right now sample weights aren't supported, would it be easy to add that?

What is your plan? Is it ready to go or does it need more tweaking?

glemaitre · 2023-01-09T13:25:43Z

How good is the automatic detection of "regression vs classification"?

type_of_target is quite good and would do the job.

In addition it feels like the code for classification, in particular multi-class classification, would be more complicated than for the regression case (you can't use np.mean(y)).

dirty_cat implements both but this is not the same branch of code: https://github.com/dirty-cat/dirty_cat/blob/main/dirty_cat/_target_encoder.py

betatim · 2023-01-09T15:37:54Z

What can the TargetEncoder from dirty cat not do/what is the advantage of making yet another implementation over taking it verbatim? For example from the constructor args of it, it seems it supports regression, binary and multi-class classification. It seems like an excellent starting point

glemaitre · 2023-01-09T15:56:36Z

From the dev meeting, we thought that could put aside the multiclass on the side for the moment and support binary classification and regression. We need to make sure that the name of the encoder reflects that but we don't have to support all possible classification and regression problems at first.

I will make a review having those points in mind.

glemaitre · 2023-01-09T15:59:31Z

What can the TargetEncoder from dirty cat not do/what is the advantage of making yet another implementation over taking it verbatim?

Since OneHotEncoder and OrdinalEncoder are working for all problems, it could come as a surprise to our user to have to select the right encoder type. There is also the precedent of categorical_encoder, cuml, dirty_cat (for the one that I am aware of) that are exposing a single encoder for all usage.

betatim · 2023-01-09T16:31:24Z

Since OneHotEncoder and OrdinalEncoder are working for all problems, it could come as a surprise to our user to have to select the right encoder type.

I agree. Having one encoder for all types of problems is nicer than having to choose.

My question was "Why not take the dirty_cat implementation of TargetEncoder (as a starting point) for the implementation in scikit-learn?"

thomasjpfan · 2023-01-09T22:50:30Z

Here are the key differences between this PR and dirty_cat's version:

This PR takes advantage of the existing code in _BaseEncoder, which greatly simplifies the implementation.
This PR does cross validation in fit_transform while dirty cat does not. The cross validation is required to prevent the target from leaking during training.
This PR's implementation is much faster because of the Cython, even with cross validation in fit_transform:

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import TargetRegressorEncoder
from dirty_cat import TargetEncoder as DirtyCatTargetEncoder
import numpy as np

rng = np.random.default_rng()
n_samples, n_features = 500_000, 20
X = rng.integers(0, high=30, size=(n_samples, n_features))
y = rng.standard_normal(size=n_samples)

%%timeit
_ = TargetRegressorEncoder().fit_transform(X, y)
# 3.37 s ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
_ = DirtyCatTargetEncoder().fit_transform(X, y)
# 9.9 s ± 220 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR is faster even when it's fit_transform learns 6 encodings: five for cv=5 and one more encoding over the whole dataset. dirty_cat only learns one encoding for the whole dataset because it does not do cross validation.

thomasjpfan · 2023-01-10T00:28:44Z

What is your plan? Is it ready to go or does it need more tweaking?

It's mostly to decide on how we want to extend the API for classification targets. Currently, this PR is the minimum requirement for regression targets. The core computation in this PR can be extended to classification without too much trouble.

type_of_target is quite good and would do the job.

I do not like how type_of_target will treats floats as classification targets. For example:

import numpy as np
from sklearn.utils.multiclass import type_of_target

type_of_target(np.asarray([1.0] * 10 + [2.0] * 30 + [4.0] * 10 + [5.0]))
# 'multiclass'

I prefer two more explicit options:

TargetRegressorEncoder and TargetClassificationEncoder
TargetEncoder with a target_type parameter to switch between regression and classification.

I went with option 1 in this PR, but I am okay with either option. For option 2, I am +0.5 on having a target_type="auto" option that will infer the type of the target.

thomasjpfan · 2023-01-10T01:03:25Z

After thinking about it a little more, I am okay with just inferring the target type with type_of_target. I updated this PR:

Infer the target type with type_of_target
Added a target_type="auto" option to allow the user to control the inference.
Support both binary classification and regression
The encoder is now called TargetEncoder

ogrisel · 2023-03-14T12:20:55Z

I pushed 26d2429 to add a statistical non-regression test that ensures that the nested cross-validation in fit_transform is actually useful to prevent overfitting in a minimal synthetic dataset that is hopefully not too artificial but reflects situations that can happen in practice. Note that the empirical Bayes shrinkage alone is not enough to prevent the pathological overfitting highlighted in this test case.

I am still no decided whether this should better be a pitfall-style example or a test. I think having it in a test makes it less likely to introduce a change in this encoder that would cause a silent regression. But maybe the existing tests are enough.

I am curious to hear your feedback.

ogrisel · 2023-03-14T14:24:17Z

sklearn/preprocessing/_target_encoder.py

+    smooth : "auto" or float, default="auto"
+        The amount of mixing of the categorical encoding with the global target mean. A
+        larger `smooth` value will put more weight on the global target mean.
+        If `"auto"`, then `smooth` is set to an empirical Bayes estimate.


I think we should allow for an n_features length parametrization here (e.g. list and/or dict with feature names as keys): the optimal smoothing for one feature is not necessarily optimal for others.

This can be done as follow-up PR though. It feels border-line YAGNI to me. In the mean time it possible to use a column-transformer to configure a per-feature TargetEncoder instance with dedicated smooth values per features.

sklearn/preprocessing/tests/test_target_encoder.py

ogrisel

Looks good to me. Based on test_target_encoding_for_linear_regression I think we need to do the nested cross-val by default and break the usual fit_transform <=> fit + transform implicit equivalence of other scikit-learn estimators. In particular, I don't see how to compute the "real" training accuracy: to do so we would need a fit_score method on pipelines (which could be a good idea by the way to save some redundant computation, but this is a digression).

Anyways, I don't see any other way around, and to me the protection against catastrophic overfitting caused by noisy high-cardinality categorical features outweighs the potentially surprising (but well documented) behavior of fit_transform.

thomasjpfan · 2023-03-14T16:16:35Z

I am still no decided whether this should better be a pitfall-style example or a test.

I think both are useful. I have not seen an example similar to your test case that demonstrates why the internal validation is useful.

In a follow up PR, we can convert the test into a pitfall style example and link it in the docstring for cv. (I'm still unsure about adding cv=None)

ogrisel

Spotted another typo in the inline comment of the new test.

sklearn/preprocessing/tests/test_target_encoder.py

ogrisel · 2023-03-14T19:18:33Z

@lorentzenchr @betatim @glemaitre any more feedback?

@jovan-stojanovic you might be interested in the new test: I checked that dirty_cat's TargetEncoder can also lead to this kind of overfitting problems because of the lack of built-in CV during fit_transform.

ogrisel · 2023-03-15T13:45:43Z

@thomasjpfan I had a devil inspired idea at coffee: we could store a weakref to the traininset set at fit time to detect if the X passed to .transform is the same as the one passed to .fit/.fit_transform. That would make it possible to hide the fit_transform / fit + transform discrepancy like cuML does but without holding a strong ref to X_train that would be problematic as it would prevent early collection of otherwise GC-able data and would make model serialization inefficient without additional care.

Still the weakref hack could still lead to surprising behaviors. For instance, while the following would work:

X_train = load_dataset_from_disk()
X_trans_1 = target_transformer.fit_transform(X_train)
X_trans_2 = target_transformer.transform(X_train)
np.assert_allclose(X_trans_1, X_trans_2)

this seemingly innocuous variation would fail:

X_train = load_dataset_from_disk()
X_trans_1 = target_transformer.fit_transform(X_train)

X_train = load_dataset_from_disk()
X_trans_2 = target_transformer.transform(X_train)
np.assert_allclose(X_trans_1, X_trans_2)

so overall, I am not 100% sure the weakref hack would be a usability improvement or not.

Feel free to pretend that you haven't read this comment and not reply. I would perfectly understand.

ogrisel · 2023-03-15T17:23:48Z

Another pitfall I discovered when experimenting with this PR:

If you have a mix of informative and non-informative categorical features (e.g. f_i and f_u), and if you apply StandardScaler() after TargetEncoder() then you can run in big troubles: the variance of target_encoder.fit_tramsform(X_train[[f_u]]) can be very small but non-zero, so StandarScaler can rescale this massively which is probably a bad idea for a downstream distance based model (e.g. a kernel machine, nearest neighbors and so on).

However if you use the raw target encoded values of f_i and f_u in conjunction of true numerical features that have all been scaled (via a dedicated entry in a column transformers) and if y has a much larger standard deviation, then f_i will have a very large standard deviation compared to the independently scaled numerical features, which is problematic.

I see two possible solutions:

a) add an option to TargetEncoder to internally use a scaled version of y_train to compute the encodings;
b) make it possible for StandardScaler or those scalers to use shared mean and scale statistics for a group of columns:

preprocessor = ColumnTransformer(
    [
        (
            "categorical",
             make_pipeline(TargetEncoder(), StandardScaler(shared_mean=True, shared_scale=True),
            ["f_i", "f_u"],
        ),
    ],
    remainder=StandardScaler(),
)

Option b) is probably useful beyond post-processing target encoded categorical values. For instance @TomDLT and I encountered a similar problem a long time ago when running make_pipeline(StandardScaler(), LogisticRegression()) with the saga solver on MNIST: some pixel values are nearly constant 0 or 1 and having per-feature scaling is catastrophically hurting the condition number of X.T @ X if I recall correctly. At the time we just decided to use MinMaxScaler instead and move along. But this is a real problems when groups of features should be re-scaled together to preserve meaning (e.g. pixel intensities, EEG channels, MEL features for audio signal and so on).

EDIT: we should probably do StandardScaler(shared_mean_and_scale=True) instead because decoupling the two might be challenging from a code maintenance point of view, and not that useful to the end users.

Even if we decide would also be a) convenient UX improvement, we can delay this to a follow-up PR.

I just wanted to brain-dump this here so that we can think about it when we work on a pitfall example for TargetEncoder-based feature engineering.

/cc @jovan-stojanovic who might also be interested for dirty_cat.

thomasjpfan · 2023-03-15T19:11:10Z

so overall, I am not 100% sure the weakref hack would be a usability improvement or not.

I think it borders on being too magical. For example, if the data is sliced the same way or copied, the references are not the same:

import numpy as np
import weakref

X = np.random.randn(10, 10)

X1 = X[:4]
X2 = X[:4]
X3 = X1.copy()

X1_ref = weakref.ref(X1)

assert X1 is X1_ref()
assert X2 is not X1_ref()
assert X3 is not X1_ref()

For reference, cuML's TargetEncoder holds the training data and checks all the values.

a) add an option to TargetEncoder to internally use a scaled version of y_train to compute the encodings;

At one point, I had something similar implemented in #17323 as the default. I think it's reasonable to use a scaled version of the target for encoding purposes.

TomDLT · 2023-03-15T19:29:23Z

For instance @TomDLT and I encountered a similar problem a long time ago when running make_pipeline(StandardScaler(), LogisticRegression()) with the saga solver on MNIST: some pixel values are nearly constant 0 or 1 and having per-feature scaling is catastrophic.

Yes, some MNIST pixels reach a value of 250 after rescaling (see Details below), which is quite out of distribution for a unit-norm Normal distribution. We should add a warning when some outputs of StandardScaler are large (say above 100, or even 10).

This is a real problems when groups of features should be re-scaled together to preserve meaning (e.g. pixel intensities, EEG channels, MEL features for audio signal and so on).

Yes, this is a big limitation of StandardScaler. +1 for adding StandardScaler(shared_mean=True, shared_scale=True).

import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm

from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler

mnist = fetch_openml("mnist_784", as_frame=False, parser="pandas")
X, y = mnist.data, mnist.target

X_scaled = StandardScaler().fit_transform(X)
max_values = X_scaled.max(axis=0)

fig, ax = plt.subplots()
image = ax.imshow(max_values.reshape(28, 28), cmap=plt.get_cmap("viridis", 6),
                  norm=LogNorm())
ax.set(xticks=[], yticks=[], title="Maximum value of each scaled MNIST feature")
fig.colorbar(image)
plt.show()

ogrisel · 2023-03-16T09:06:57Z

a) add an option to TargetEncoder to internally use a scaled version of y_train to compute the encodings;

At one point, I had something similar implemented in #17323 as the default. I think it's reasonable to use a scaled version of the target for encoding purposes.

Let's keep that in mind for a follow-up PR. But it we want to make it the default (which would probably be helpful), we should probably do that before the 1.3 release.

ogrisel · 2023-03-16T12:30:12Z

For reference, cuML's TargetEncoder holds the training data and checks all the values.

Note that we could use a weakref + a concrete value check. But even that would feel to complex/magical. +0.5 for keeping the code as it is.

glemaitre

I have only one nitpick.

examples/preprocessing/plot_target_encoder.py

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

ogrisel · 2023-03-16T15:59:39Z

Merged! Thank you very much @thomasjpfan!

lorentzenchr · 2023-03-16T21:37:14Z

Just an idea: For detection of the training set during transform, we could store mean, variance and maybe 2 quantiles. This should be pretty unique.

ogrisel · 2023-03-17T07:52:55Z

The checks need to happen before encoding on the categorical variables. We could store the per feature category counts instead. Maybe with a few random probe records that contain several features with infrequent categories.

ogrisel · 2023-03-17T07:54:59Z

The checks need to happen before encoding on the categorical variables. We could store the per feature category counts instead. Maybe with a few random probe records.

But this would be quite catastrophic in case of false positives.

lorentzenchr · 2023-03-17T11:30:37Z

Now I see the difficulty. Maybe it is good enough as is. In principle, we would need to detect every single row of the training set and that’s the responsibility of the user, isn‘t it.

betatim · 2023-03-17T13:38:21Z

Whoop whoop! Nice work!

BenjaminBossan · 2023-03-20T10:56:54Z

Pretty nice addition, thanks for this.

A small question: According to the tags, TargetEncoder does not require y, should this be changed?

>>> from sklearn.preprocessing import TargetEncoder
>>> from sklearn.utils._tags import _safe_tags
>>> _safe_tags(TargetEncoder())['requires_y']
False

thomasjpfan added 2 commits January 8, 2023 16:13

ENH Adds Target Regression Encoder

f29c2d8

DOC Adds pr number

336b17d

github-actions bot added cython module:preprocessing labels Jan 8, 2023

thomasjpfan mentioned this pull request Jan 8, 2023

ENH Adds Target Regression Encoder (Impact Encoder) #17323

Closed

DOC Adds pr number

c52ce42

thomasjpfan changed the title ~~ENH Adds Target Regression Encoder~~ ENH Adds Target Regressor Encoder Jan 8, 2023

thomasjpfan added 4 commits January 8, 2023 20:32

DOC Adds example

1863be9

DOC Fixes example link

22af6a6

FIX Use fancy indexing

f535f54

FIX Fix issue with 32bit

c6d1fc4

glemaitre reviewed Jan 9, 2023

View changes

sklearn/preprocessing/_target_encoder.py Outdated Show resolved Hide resolved

betatim reviewed Jan 9, 2023

View changes

thomasjpfan added 2 commits January 9, 2023 20:01

ENH Adds support for binary classification

2868b3c

TST Adds test to check target_type_

aa6e545

thomasjpfan changed the title ~~ENH Adds Target Regressor Encoder~~ ENH Adds TargetEncoder Jan 10, 2023

thomasjpfan added 4 commits January 9, 2023 20:05

DOC Update whats new with better wording

871ea45

DOC Update docs about target encoder itself

71e6bad

TST Update names of tests

fdd39d0

DOC better names for variables

c8d5546

ogrisel reviewed Mar 14, 2023

View changes

sklearn/preprocessing/tests/test_target_encoder.py Outdated Show resolved Hide resolved

ogrisel and others added 2 commits March 14, 2023 17:06

Fix comment in test to more accurately describe what's happening

d77e519

TST Use RNG for permutation

05997bb

ogrisel approved these changes Mar 14, 2023

View changes

ogrisel reviewed Mar 14, 2023

View changes

sklearn/preprocessing/tests/test_target_encoder.py Outdated Show resolved Hide resolved

sklearn/preprocessing/tests/test_target_encoder.py Outdated Show resolved Hide resolved

More fixes in inline comments for the new test

6377419

glemaitre self-requested a review March 16, 2023 12:53

glemaitre approved these changes Mar 16, 2023

View changes

examples/preprocessing/plot_target_encoder.py Outdated Show resolved Hide resolved

Use pandas to parse the fetched the example dataset.

180ee0e

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

ogrisel merged commit 392fdee into scikit-learn:main Mar 16, 2023
24 checks passed

beckernick mentioned this pull request Mar 17, 2023

[QST] Evaluate TargetEncoder implementation vs. scikit-learn rapidsai/cuml#5280

Open

This was referenced Mar 20, 2023

FIX Adds requires_y tag to TargetEncoder #25917

Merged

API Set random_state=0 for TargetEncoder #25927

Closed

ENH Adds TargetEncoder #25334

ENH Adds TargetEncoder #25334

thomasjpfan commented Jan 8, 2023 •

edited by ogrisel

glemaitre commented Jan 9, 2023

betatim commented Jan 9, 2023

glemaitre commented Jan 9, 2023

betatim commented Jan 9, 2023

glemaitre commented Jan 9, 2023

glemaitre commented Jan 9, 2023

betatim commented Jan 9, 2023

thomasjpfan commented Jan 9, 2023 •

edited

thomasjpfan commented Jan 10, 2023 •

edited

thomasjpfan commented Jan 10, 2023 •

edited

ogrisel commented Mar 14, 2023 •

edited

ogrisel Mar 14, 2023

ogrisel left a comment •

edited

thomasjpfan commented Mar 14, 2023

ogrisel left a comment

ogrisel commented Mar 14, 2023

ogrisel commented Mar 15, 2023 •

edited

ogrisel commented Mar 15, 2023 •

edited

thomasjpfan commented Mar 15, 2023

TomDLT commented Mar 15, 2023

ogrisel commented Mar 16, 2023

ogrisel commented Mar 16, 2023

glemaitre left a comment

ogrisel commented Mar 16, 2023

lorentzenchr commented Mar 16, 2023

ogrisel commented Mar 17, 2023 •

edited

ogrisel commented Mar 17, 2023

lorentzenchr commented Mar 17, 2023

betatim commented Mar 17, 2023

BenjaminBossan commented Mar 20, 2023 •

edited

ENH Adds TargetEncoder #25334

ENH Adds TargetEncoder #25334

Conversation

thomasjpfan commented Jan 8, 2023 • edited by ogrisel

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

glemaitre commented Jan 9, 2023

betatim commented Jan 9, 2023

glemaitre commented Jan 9, 2023

betatim commented Jan 9, 2023

glemaitre commented Jan 9, 2023

glemaitre commented Jan 9, 2023

betatim commented Jan 9, 2023

thomasjpfan commented Jan 9, 2023 • edited

thomasjpfan commented Jan 10, 2023 • edited

thomasjpfan commented Jan 10, 2023 • edited

ogrisel commented Mar 14, 2023 • edited

ogrisel Mar 14, 2023

Choose a reason for hiding this comment

ogrisel left a comment • edited

Choose a reason for hiding this comment

thomasjpfan commented Mar 14, 2023

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel commented Mar 14, 2023

ogrisel commented Mar 15, 2023 • edited

ogrisel commented Mar 15, 2023 • edited

thomasjpfan commented Mar 15, 2023

TomDLT commented Mar 15, 2023

ogrisel commented Mar 16, 2023

ogrisel commented Mar 16, 2023

glemaitre left a comment

Choose a reason for hiding this comment

ogrisel commented Mar 16, 2023

lorentzenchr commented Mar 16, 2023

ogrisel commented Mar 17, 2023 • edited

ogrisel commented Mar 17, 2023

lorentzenchr commented Mar 17, 2023

betatim commented Mar 17, 2023

BenjaminBossan commented Mar 20, 2023 • edited

thomasjpfan commented Jan 8, 2023 •

edited by ogrisel

thomasjpfan commented Jan 9, 2023 •

edited

thomasjpfan commented Jan 10, 2023 •

edited

thomasjpfan commented Jan 10, 2023 •

edited

ogrisel commented Mar 14, 2023 •

edited

ogrisel left a comment •

edited

ogrisel commented Mar 15, 2023 •

edited

ogrisel commented Mar 15, 2023 •

edited

ogrisel commented Mar 17, 2023 •

edited

BenjaminBossan commented Mar 20, 2023 •

edited