Pipeline: apply all transformations except the last classifier #8414

mratsim · 2017-02-20T22:00:42Z

Pipeline should provide a method to apply its transformations to an arbitrary dataset without transform from the last classifier step.

Use case:

Boosted tree models like XGBoost and LightGBM use a validation set for early stopping.
We can trivially apply the pipeline to train and test via fit and predict but not for the validation set.

After raising the issue and proposing 2 ideas at LightGBM, microsoft/LightGBM#299 and XGBoost, dmlc/xgboost#2039, I believe it should be handled at Scikit-learn level.

Idea 1, have a dummy `transform` method in `XGBClassifier` and `LGBMClassifier`

The transform method for pipeline/classifier is already extremely inconsistent :

Failure because the classifier step does not implement transform
Deprecated feature importance extraction for trees ensemble
NN features proposition for MLPClassifier #8291
Decision path proposition for trees ensemble #7907
Furthermore the issue will pop up again if the last classifier is an ensemble of multiple models

Idea 2, Implement a validation_split parameter for early stopping

Early stopping in KerasClassifier is controlled by a validation_split parameter.
At first I thought that could be used in XGBClassifier and LGBMClassifier and everything else that would need a validation set for early stopping.
The issue here is that there is no control over the validation set and split. Furthermore if there is a need to inspect deeper into validation issues, I suppose it would be non-trivial to extract it from the classifier or provide an API for it.

Hence I think Scikit-learn need a method or parameter in transform to ignore the last step or the last n steps.

If needed I can raise a related issue on having a consistent transform method for classifiers and keep this one focused on applying transform without classification on arbitrary data.

jnothman · 2017-02-20T22:49:52Z

Thanks. I tend to do this as Pipeline(my_pipeline.steps[:-1]). transform() but I can see how dedicating a method for this makes sense. One argument against the proposal is that I am trying to make the transformer interface richer (e.g. to get feature descriptions), and any new methods would be available if you create a new pipeline as in my approach, but would need further methods for yours. I've also considered adding a method to pull out any subsequence of steps as a new pipeline. One approach (#2568), although @GaelVaroquaux derided it as too magical, would allow my_pipeline[:- 1].transform(). Would that suit your purposes?

mratsim · 2017-02-21T00:41:33Z

my_pipeline[:-1].transform() would be great yes. It would probably need examples because [:-1] is hard to google.

Regarding feature description, currently I am using this code to get feature description + ranking by tree ensembles.
It walks through all transformations in a pipeline.

The main issue is inconsistency between OneHotEncoder (active_features_) and LabelBinarizer (classes_) at least. Furthermore for n-category features LabelBinarizer return n columns except if there are 2, it returns 1 column only which tripped me and others

Edit: this should probably be discussed in another thread. Also you probably referenced the wrong issue (2013)

jnothman · 2017-02-21T00:51:29Z

Well, you shouldn't really be using LabelBinarizer for features. But there are issues in Github for each of these things. In terms of getting lists of important features, I think you might be interested in eli5 and the Pipeline support I've proposed there: TeamHG-Memex/eli5#158

…

On 21 February 2017 at 11:41, Mamy Ratsimbazafy ***@***.***> wrote: my_pipeline[:-1].transform() would be great yes. It would probably need examples because [:-1] is hard to google. Regarding feature description, currently I am using this code to get feature description + ranking by tree ensembles https://github.com/mratsim/MachineLearning_Kaggle/blob/ master/Kaggle%20-%20001%20-%20Titanic%20Survivors/Kaggle- 001-Python-MagicalForest.py#L526 It walks through all transformations in a pipeline. The main issue is inconsistency between OneEncoder (active_features_) and LabelBinarizer (classes_) at least. Furthermore for n-category features LabeBinarizer return n columns except if there are 2 (1 col). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8414 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz67tb7FzlPzXm02qA1LlGTs3HKASPks5rejK-gaJpZM4MGqKl> .

Conceptually Fixes scikit-learn#8414 and related issues. Alternative to scikit-learn#2568 without __getitem__ and mixed semantics. Designed to assist in model inspection and particularly to replicate the composite transformer represented by steps of the pipeline with the exception of the last. I.e. pipe.get_subsequence(0, -1) is a common idiom. I feel like this becomes more necessary when considering more API-consistent clone behaviour as per scikit-learn#8350 as Pipeline(pipe.steps[:-1]) is no longer possible.

piccolbo · 2018-03-02T22:40:27Z

Entering the conversation a little late, sorry, but concerning the reference to KerasClassifier above, the data on which validation loss and optionally early stopping are calculated is controlled by two parameters, validation_split as referenced above or validation_data. The former implements a random subsample split, the latter, which takes priority, allows to provide an arbitrary validation set. In Keras, these are fit arguments. Any data transformation such as normalization or imputation happens prior to a call to fit for a Keras model. When specifying a sklearn pipeline, at pipeline creation time, I am generally able to provide a value for validation_split, but to provide validation_data I need the ability to transform it. That is, I need to have fitted the first n-1 steps of the pipeline before a Pipeline is created! It's a catch 22. I don't understand how the proposal here (pipeline subsequence) addresses this problem.

#code sketch, not real
# X_train, X_test are what you think they are. 

a_pipeline = make_pipeline(StandardScaler(),KerasClassifier(..., validation_split = .3,  ...))
#works

scaler = StandardScaler().fit(X_train)
a_pipeline1 = make_pipeline(StandardScaler(),KerasClassifier(..., validation_data = scaler.transform(X_test),  ...))
# Works, but I am repeating the preprocessing steps, fitting them "by hand", negating pipeline beauty

a_pipeline2 = make_pipeline(StandardScaler(),KerasClassifier(..., validation_data = a_pipeline2[:-1].transform(X_test),  ...))
# a_pipeline2 undefined error and even if it were defined, it would need to be fitted.
# I was trying to solve this with pipeline subsequence

preprocess = StandardScaler() #or more steps
preprocess.fit(X_train)
a_pipeline3 = make_pipeline(preprocess, KerasClassifier(..., validation_data = preprocess.transform(X_test),  ...))
#now we can encapsulate all preprocessing in one sub-pipeline, but still calling fit by hand
# could fit on wrong set, etc

(by the way, the use of a test set in a stopping rule is questionable. Keras conflates the two, not me)
If I could have it my way, validation_data would be a function that splits the input set as it sees fit and it would be used for all models, not just the ones that use early stopping or some such. I know there is score, but it's not the same. score lets the user provide X. If X is selected appropriately, we get a useful estimate, if not, not so much. I think evaluating a model should be a top-level concern, like fitting or predicting.

jnothman · 2018-03-03T23:25:58Z

You're right that I'm not sure this fixes the problem. I'm also not sure how to fit validation_data into our pipeline design. We would need a different pipeline definition, or some parameter that tells it how to specifically handle validation_data :/

piccolbo · 2018-03-07T06:43:49Z

It doesn't seem to fit at all, I have to agree. I've been thinking about a computation graph type abstraction where train and test follow different paths before converging into a classifier, but it's hard when there are side effects between the nodes and some nodes need to be treated differently by fit and transform (test can't be used for fitting). It seems like one has just to write custom fit and transform methods. Would love to be proven wrong.

This was referenced Feb 22, 2017

[WIP] ENH allow extraction of subsequence pipeline #8431

Closed

Pipeline pop #8448

Closed

FrugoFruit90 mentioned this issue Oct 16, 2017

Extend `FeatureUnion` to better handle heterogeneous data #2034

Closed

mratsim mentioned this issue May 20, 2018

contribution: scikit-learn style machine learning estimators? mratsim/Arraymancer#238

Open

amueller mentioned this issue Feb 7, 2019

SLEP needed: slicling pipelines scikit-learn/enhancement_proposals#13

Closed

scikit-learn / scikit-learn

Pipeline: apply all transformations except the last classifier #8414

Pipeline: apply all transformations except the last classifier #8414

mratsim commented Feb 20, 2017 •

edited

jnothman commented Feb 20, 2017 •

edited

mratsim commented Feb 21, 2017 •

edited

jnothman commented Feb 21, 2017

piccolbo commented Mar 2, 2018 •

edited

jnothman commented Mar 3, 2018

piccolbo commented Mar 7, 2018

scikit-learn / scikit-learn

Sponsor scikit-learn/scikit-learn

Join GitHub today

Pipeline: apply all transformations except the last classifier #8414

Pipeline: apply all transformations except the last classifier #8414

Comments

mratsim commented Feb 20, 2017 • edited

Use case:

Idea 1, have a dummy transform method in XGBClassifier and LGBMClassifier

Idea 2, Implement a validation_split parameter for early stopping

jnothman commented Feb 20, 2017 • edited

mratsim commented Feb 21, 2017 • edited

jnothman commented Feb 21, 2017

piccolbo commented Mar 2, 2018 • edited

jnothman commented Mar 3, 2018

piccolbo commented Mar 7, 2018

Essential cookies

Always active

Analytics cookies

mratsim commented Feb 20, 2017 •

edited

Idea 1, have a dummy `transform` method in `XGBClassifier` and `LGBMClassifier`

jnothman commented Feb 20, 2017 •

edited

mratsim commented Feb 21, 2017 •

edited

piccolbo commented Mar 2, 2018 •

edited