Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Adds TargetEncoder #25334

Merged
merged 91 commits into from Mar 16, 2023

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Jan 8, 2023

Reference Issues/PRs

Closes #5853
Closes #9614
Supersedes #17323
Fixes or at least related to #24967

What does this implement/fix? Explain your changes.

This PR implements a target encoder which uses CV during fit_transform to prevent the target from leaking. transform uses the the target encoding from all the training data. This means that fit_transform() != fit().transform().

The implementation uses Cython to learn the encoding which provides a 10x speed up compared to using a pure Python+NumPy approach. Cython is required because many encodings are learn during cross validation in fit_transform.

Any other comments?

The implementation uses the same scheme as cuML's TargetEncoder, which they used to win Recsys2020.

@thomasjpfan thomasjpfan changed the title ENH Adds Target Regression Encoder ENH Adds Target Regressor Encoder Jan 8, 2023
@glemaitre
Copy link
Contributor

Do we want to have Regressor in the naming since we could make the encoder work on both regression and classification problem?

@betatim
Copy link
Member

betatim commented Jan 9, 2023

Do we want to have Regressor in the naming since we could make the encoder work on both regression and classification problem?

To have a encoder that works for both classification and regression I think we'd have to detect the type of problem based on the vaules of y. How good is the automatic detection of "regression vs classification"? In addition it feels like the code for classification, in particular multi-class classification, would be more complicated than for the regression case (you can't use np.mean(y)).


I think the way this works is that when performing the change from "categories" to "encoded categories" _BaseEncoder._transform returns a version of X where a column representing a categorical feature is replaced by a column of integers. This is then translated to the actual encoded value in _transform_X_int. On top we have the complexity of fit().transform() and fit_transform() not doing exactly the same thing. It took a lot of going back and forth and working out what _BaseEncoder does to get to that understanding.

After reading the code and example once my big picture thoughts are:

  • nice!
  • this looks like it should work,
  • can we have more doc strings and comments that explain why something is the way it is
  • both doc string and naming stuff should be done after code structure, etc
  • right now sample weights aren't supported, would it be easy to add that?

What is your plan? Is it ready to go or does it need more tweaking?

@glemaitre
Copy link
Contributor

How good is the automatic detection of "regression vs classification"?

type_of_target is quite good and would do the job.

In addition it feels like the code for classification, in particular multi-class classification, would be more complicated than for the regression case (you can't use np.mean(y)).

dirty_cat implements both but this is not the same branch of code: https://github.com/dirty-cat/dirty_cat/blob/main/dirty_cat/_target_encoder.py

@betatim
Copy link
Member

betatim commented Jan 9, 2023

What can the TargetEncoder from dirty cat not do/what is the advantage of making yet another implementation over taking it verbatim? For example from the constructor args of it, it seems it supports regression, binary and multi-class classification. It seems like an excellent starting point

@glemaitre
Copy link
Contributor

From the dev meeting, we thought that could put aside the multiclass on the side for the moment and support binary classification and regression. We need to make sure that the name of the encoder reflects that but we don't have to support all possible classification and regression problems at first.

I will make a review having those points in mind.

@glemaitre
Copy link
Contributor

What can the TargetEncoder from dirty cat not do/what is the advantage of making yet another implementation over taking it verbatim?

Since OneHotEncoder and OrdinalEncoder are working for all problems, it could come as a surprise to our user to have to select the right encoder type. There is also the precedent of categorical_encoder, cuml, dirty_cat (for the one that I am aware of) that are exposing a single encoder for all usage.

@betatim
Copy link
Member

betatim commented Jan 9, 2023

Since OneHotEncoder and OrdinalEncoder are working for all problems, it could come as a surprise to our user to have to select the right encoder type.

I agree. Having one encoder for all types of problems is nicer than having to choose.

My question was "Why not take the dirty_cat implementation of TargetEncoder (as a starting point) for the implementation in scikit-learn?"

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Jan 9, 2023

Here are the key differences between this PR and dirty_cat's version:

  1. This PR takes advantage of the existing code in _BaseEncoder, which greatly simplifies the implementation.
  2. This PR does cross validation in fit_transform while dirty cat does not. The cross validation is required to prevent the target from leaking during training.
  3. This PR's implementation is much faster because of the Cython, even with cross validation in fit_transform:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import TargetRegressorEncoder
from dirty_cat import TargetEncoder as DirtyCatTargetEncoder
import numpy as np

rng = np.random.default_rng()
n_samples, n_features = 500_000, 20
X = rng.integers(0, high=30, size=(n_samples, n_features))
y = rng.standard_normal(size=n_samples)
%%timeit
_ = TargetRegressorEncoder().fit_transform(X, y)
# 3.37 s ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
_ = DirtyCatTargetEncoder().fit_transform(X, y)
# 9.9 s ± 220 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR is faster even when it's fit_transform learns 6 encodings: five for cv=5 and one more encoding over the whole dataset. dirty_cat only learns one encoding for the whole dataset because it does not do cross validation.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Jan 10, 2023

What is your plan? Is it ready to go or does it need more tweaking?

It's mostly to decide on how we want to extend the API for classification targets. Currently, this PR is the minimum requirement for regression targets. The core computation in this PR can be extended to classification without too much trouble.

type_of_target is quite good and would do the job.

I do not like how type_of_target will treats floats as classification targets. For example:

import numpy as np
from sklearn.utils.multiclass import type_of_target

type_of_target(np.asarray([1.0] * 10 + [2.0] * 30 + [4.0] * 10 + [5.0]))
# 'multiclass'

I prefer two more explicit options:

  1. TargetRegressorEncoder and TargetClassificationEncoder
  2. TargetEncoder with a target_type parameter to switch between regression and classification.

I went with option 1 in this PR, but I am okay with either option. For option 2, I am +0.5 on having a target_type="auto" option that will infer the type of the target.

@thomasjpfan thomasjpfan changed the title ENH Adds Target Regressor Encoder ENH Adds TargetEncoder Jan 10, 2023
@thomasjpfan
Copy link
Member Author

thomasjpfan commented Jan 10, 2023

After thinking about it a little more, I am okay with just inferring the target type with type_of_target. I updated this PR:

  1. Infer the target type with type_of_target
  2. Added a target_type="auto" option to allow the user to control the inference.
  3. Support both binary classification and regression
  4. The encoder is now called TargetEncoder

@ogrisel
Copy link
Member

ogrisel commented Mar 14, 2023

I pushed 26d2429 to add a statistical non-regression test that ensures that the nested cross-validation in fit_transform is actually useful to prevent overfitting in a minimal synthetic dataset that is hopefully not too artificial but reflects situations that can happen in practice. Note that the empirical Bayes shrinkage alone is not enough to prevent the pathological overfitting highlighted in this test case.

I am still no decided whether this should better be a pitfall-style example or a test. I think having it in a test makes it less likely to introduce a change in this encoder that would cause a silent regression. But maybe the existing tests are enough.

I am curious to hear your feedback.

smooth : "auto" or float, default="auto"
The amount of mixing of the categorical encoding with the global target mean. A
larger `smooth` value will put more weight on the global target mean.
If `"auto"`, then `smooth` is set to an empirical Bayes estimate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should allow for an n_features length parametrization here (e.g. list and/or dict with feature names as keys): the optimal smoothing for one feature is not necessarily optimal for others.

This can be done as follow-up PR though. It feels border-line YAGNI to me. In the mean time it possible to use a column-transformer to configure a per-feature TargetEncoder instance with dedicated smooth values per features.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Based on test_target_encoding_for_linear_regression I think we need to do the nested cross-val by default and break the usual fit_transform <=> fit + transform implicit equivalence of other scikit-learn estimators. In particular, I don't see how to compute the "real" training accuracy: to do so we would need a fit_score method on pipelines (which could be a good idea by the way to save some redundant computation, but this is a digression).

Anyways, I don't see any other way around, and to me the protection against catastrophic overfitting caused by noisy high-cardinality categorical features outweighs the potentially surprising (but well documented) behavior of fit_transform.

@thomasjpfan
Copy link
Member Author

I am still no decided whether this should better be a pitfall-style example or a test.

I think both are useful. I have not seen an example similar to your test case that demonstrates why the internal validation is useful.

In a follow up PR, we can convert the test into a pitfall style example and link it in the docstring for cv. (I'm still unsure about adding cv=None)

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spotted another typo in the inline comment of the new test.

sklearn/preprocessing/tests/test_target_encoder.py Outdated Show resolved Hide resolved
sklearn/preprocessing/tests/test_target_encoder.py Outdated Show resolved Hide resolved
@ogrisel
Copy link
Member

ogrisel commented Mar 14, 2023

@lorentzenchr @betatim @glemaitre any more feedback?

@jovan-stojanovic you might be interested in the new test: I checked that dirty_cat's TargetEncoder can also lead to this kind of overfitting problems because of the lack of built-in CV during fit_transform.

@ogrisel
Copy link
Member

ogrisel commented Mar 15, 2023

@thomasjpfan I had a devil inspired idea at coffee: we could store a weakref to the traininset set at fit time to detect if the X passed to .transform is the same as the one passed to .fit/.fit_transform. That would make it possible to hide the fit_transform / fit + transform discrepancy like cuML does but without holding a strong ref to X_train that would be problematic as it would prevent early collection of otherwise GC-able data and would make model serialization inefficient without additional care.

Still the weakref hack could still lead to surprising behaviors. For instance, while the following would work:

X_train = load_dataset_from_disk()
X_trans_1 = target_transformer.fit_transform(X_train)
X_trans_2 = target_transformer.transform(X_train)
np.assert_allclose(X_trans_1, X_trans_2)

this seemingly innocuous variation would fail:

X_train = load_dataset_from_disk()
X_trans_1 = target_transformer.fit_transform(X_train)

X_train = load_dataset_from_disk()
X_trans_2 = target_transformer.transform(X_train)
np.assert_allclose(X_trans_1, X_trans_2)

so overall, I am not 100% sure the weakref hack would be a usability improvement or not.

Feel free to pretend that you haven't read this comment and not reply. I would perfectly understand.

@ogrisel
Copy link
Member

ogrisel commented Mar 15, 2023

Another pitfall I discovered when experimenting with this PR:

If you have a mix of informative and non-informative categorical features (e.g. f_i and f_u), and if you apply StandardScaler() after TargetEncoder() then you can run in big troubles: the variance of target_encoder.fit_tramsform(X_train[[f_u]]) can be very small but non-zero, so StandarScaler can rescale this massively which is probably a bad idea for a downstream distance based model (e.g. a kernel machine, nearest neighbors and so on).

However if you use the raw target encoded values of f_i and f_u in conjunction of true numerical features that have all been scaled (via a dedicated entry in a column transformers) and if y has a much larger standard deviation, then f_i will have a very large standard deviation compared to the independently scaled numerical features, which is problematic.

I see two possible solutions:

  • a) add an option to TargetEncoder to internally use a scaled version of y_train to compute the encodings;
  • b) make it possible for StandardScaler or those scalers to use shared mean and scale statistics for a group of columns:
preprocessor = ColumnTransformer(
    [
        (
            "categorical",
             make_pipeline(TargetEncoder(), StandardScaler(shared_mean=True, shared_scale=True),
            ["f_i", "f_u"],
        ),
    ],
    remainder=StandardScaler(),
)

Option b) is probably useful beyond post-processing target encoded categorical values. For instance @TomDLT and I encountered a similar problem a long time ago when running make_pipeline(StandardScaler(), LogisticRegression()) with the saga solver on MNIST: some pixel values are nearly constant 0 or 1 and having per-feature scaling is catastrophically hurting the condition number of X.T @ X if I recall correctly. At the time we just decided to use MinMaxScaler instead and move along. But this is a real problems when groups of features should be re-scaled together to preserve meaning (e.g. pixel intensities, EEG channels, MEL features for audio signal and so on).

EDIT: we should probably do StandardScaler(shared_mean_and_scale=True) instead because decoupling the two might be challenging from a code maintenance point of view, and not that useful to the end users.

Even if we decide would also be a) convenient UX improvement, we can delay this to a follow-up PR.

I just wanted to brain-dump this here so that we can think about it when we work on a pitfall example for TargetEncoder-based feature engineering.

/cc @jovan-stojanovic who might also be interested for dirty_cat.

@thomasjpfan
Copy link
Member Author

so overall, I am not 100% sure the weakref hack would be a usability improvement or not.

I think it borders on being too magical. For example, if the data is sliced the same way or copied, the references are not the same:

import numpy as np
import weakref

X = np.random.randn(10, 10)

X1 = X[:4]
X2 = X[:4]
X3 = X1.copy()

X1_ref = weakref.ref(X1)

assert X1 is X1_ref()
assert X2 is not X1_ref()
assert X3 is not X1_ref()

For reference, cuML's TargetEncoder holds the training data and checks all the values.

a) add an option to TargetEncoder to internally use a scaled version of y_train to compute the encodings;

At one point, I had something similar implemented in #17323 as the default. I think it's reasonable to use a scaled version of the target for encoding purposes.

@TomDLT
Copy link
Member

TomDLT commented Mar 15, 2023

For instance @TomDLT and I encountered a similar problem a long time ago when running make_pipeline(StandardScaler(), LogisticRegression()) with the saga solver on MNIST: some pixel values are nearly constant 0 or 1 and having per-feature scaling is catastrophic.

Yes, some MNIST pixels reach a value of 250 after rescaling (see Details below), which is quite out of distribution for a unit-norm Normal distribution. We should add a warning when some outputs of StandardScaler are large (say above 100, or even 10).

This is a real problems when groups of features should be re-scaled together to preserve meaning (e.g. pixel intensities, EEG channels, MEL features for audio signal and so on).

Yes, this is a big limitation of StandardScaler. +1 for adding StandardScaler(shared_mean=True, shared_scale=True).

Figure_1

import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm

from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler

mnist = fetch_openml("mnist_784", as_frame=False, parser="pandas")
X, y = mnist.data, mnist.target

X_scaled = StandardScaler().fit_transform(X)
max_values = X_scaled.max(axis=0)

fig, ax = plt.subplots()
image = ax.imshow(max_values.reshape(28, 28), cmap=plt.get_cmap("viridis", 6),
                  norm=LogNorm())
ax.set(xticks=[], yticks=[], title="Maximum value of each scaled MNIST feature")
fig.colorbar(image)
plt.show()

@ogrisel
Copy link
Member

ogrisel commented Mar 16, 2023

a) add an option to TargetEncoder to internally use a scaled version of y_train to compute the encodings;

At one point, I had something similar implemented in #17323 as the default. I think it's reasonable to use a scaled version of the target for encoding purposes.

Let's keep that in mind for a follow-up PR. But it we want to make it the default (which would probably be helpful), we should probably do that before the 1.3 release.

@ogrisel
Copy link
Member

ogrisel commented Mar 16, 2023

For reference, cuML's TargetEncoder holds the training data and checks all the values.

Note that we could use a weakref + a concrete value check. But even that would feel to complex/magical. +0.5 for keeping the code as it is.

@glemaitre glemaitre self-requested a review March 16, 2023 12:53
Copy link
Contributor

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only one nitpick.

examples/preprocessing/plot_target_encoder.py Outdated Show resolved Hide resolved
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@ogrisel ogrisel merged commit 392fdee into scikit-learn:main Mar 16, 2023
24 checks passed
@ogrisel
Copy link
Member

ogrisel commented Mar 16, 2023

Merged! Thank you very much @thomasjpfan!

@lorentzenchr
Copy link
Member

Just an idea: For detection of the training set during transform, we could store mean, variance and maybe 2 quantiles. This should be pretty unique.

@ogrisel
Copy link
Member

ogrisel commented Mar 17, 2023

The checks need to happen before encoding on the categorical variables. We could store the per feature category counts instead. Maybe with a few random probe records that contain several features with infrequent categories.

@ogrisel
Copy link
Member

ogrisel commented Mar 17, 2023

The checks need to happen before encoding on the categorical variables. We could store the per feature category counts instead. Maybe with a few random probe records.

But this would be quite catastrophic in case of false positives.

@lorentzenchr
Copy link
Member

Now I see the difficulty. Maybe it is good enough as is. In principle, we would need to detect every single row of the training set and that’s the responsibility of the user, isn‘t it.

@betatim
Copy link
Member

betatim commented Mar 17, 2023

Whoop whoop! Nice work!

@BenjaminBossan
Copy link
Contributor

BenjaminBossan commented Mar 20, 2023

Pretty nice addition, thanks for this.

A small question: According to the tags, TargetEncoder does not require y, should this be changed?

>>> from sklearn.preprocessing import TargetEncoder
>>> from sklearn.utils._tags import _safe_tags
>>> _safe_tags(TargetEncoder())['requires_y']
False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CountFeaturizer for categorical data
10 participants