Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline: apply all transformations except the last classifier #8414

Open
mratsim opened this issue Feb 20, 2017 · 6 comments
Open

Pipeline: apply all transformations except the last classifier #8414

mratsim opened this issue Feb 20, 2017 · 6 comments

Comments

@mratsim
Copy link

@mratsim mratsim commented Feb 20, 2017

Pipeline should provide a method to apply its transformations to an arbitrary dataset without transform from the last classifier step.

Use case:

Boosted tree models like XGBoost and LightGBM use a validation set for early stopping.
We can trivially apply the pipeline to train and test via fit and predict but not for the validation set.


After raising the issue and proposing 2 ideas at LightGBM, microsoft/LightGBM#299 and XGBoost, dmlc/xgboost#2039, I believe it should be handled at Scikit-learn level.

Idea 1, have a dummy transform method in XGBClassifier and LGBMClassifier

The transform method for pipeline/classifier is already extremely inconsistent :

  • Failure because the classifier step does not implement transform
  • Deprecated feature importance extraction for trees ensemble
  • NN features proposition for MLPClassifier #8291
  • Decision path proposition for trees ensemble #7907
    Furthermore the issue will pop up again if the last classifier is an ensemble of multiple models

Idea 2, Implement a validation_split parameter for early stopping

Early stopping in KerasClassifier is controlled by a validation_split parameter.
At first I thought that could be used in XGBClassifier and LGBMClassifier and everything else that would need a validation set for early stopping.
The issue here is that there is no control over the validation set and split. Furthermore if there is a need to inspect deeper into validation issues, I suppose it would be non-trivial to extract it from the classifier or provide an API for it.


Hence I think Scikit-learn need a method or parameter in transform to ignore the last step or the last n steps.

If needed I can raise a related issue on having a consistent transform method for classifiers and keep this one focused on applying transform without classification on arbitrary data.

@jnothman
Copy link
Member

@jnothman jnothman commented Feb 20, 2017

@mratsim
Copy link
Author

@mratsim mratsim commented Feb 21, 2017

my_pipeline[:-1].transform() would be great yes. It would probably need examples because [:-1] is hard to google.

Regarding feature description, currently I am using this code to get feature description + ranking by tree ensembles.
It walks through all transformations in a pipeline.

The main issue is inconsistency between OneHotEncoder (active_features_) and LabelBinarizer (classes_) at least. Furthermore for n-category features LabelBinarizer return n columns except if there are 2, it returns 1 column only which tripped me and others

Edit: this should probably be discussed in another thread. Also you probably referenced the wrong issue (2013)

@jnothman
Copy link
Member

@jnothman jnothman commented Feb 21, 2017

jnothman added a commit to jnothman/scikit-learn that referenced this issue Feb 22, 2017
Conceptually Fixes scikit-learn#8414 and related issues. Alternative to scikit-learn#2568
without __getitem__ and mixed semantics.

Designed to assist in model inspection and particularly to replicate the
composite transformer represented by steps of the pipeline with the
exception of the last. I.e. pipe.get_subsequence(0, -1) is a common
idiom. I feel like this becomes more necessary when considering more
API-consistent clone behaviour as per scikit-learn#8350 as Pipeline(pipe.steps[:-1])
is no longer possible.
@piccolbo
Copy link

@piccolbo piccolbo commented Mar 2, 2018

Entering the conversation a little late, sorry, but concerning the reference to KerasClassifier above, the data on which validation loss and optionally early stopping are calculated is controlled by two parameters, validation_split as referenced above or validation_data. The former implements a random subsample split, the latter, which takes priority, allows to provide an arbitrary validation set. In Keras, these are fit arguments. Any data transformation such as normalization or imputation happens prior to a call to fit for a Keras model. When specifying a sklearn pipeline, at pipeline creation time, I am generally able to provide a value for validation_split, but to provide validation_data I need the ability to transform it. That is, I need to have fitted the first n-1 steps of the pipeline before a Pipeline is created! It's a catch 22. I don't understand how the proposal here (pipeline subsequence) addresses this problem.

#code sketch, not real
# X_train, X_test are what you think they are. 

a_pipeline = make_pipeline(StandardScaler(),KerasClassifier(..., validation_split = .3,  ...))
#works

scaler = StandardScaler().fit(X_train)
a_pipeline1 = make_pipeline(StandardScaler(),KerasClassifier(..., validation_data = scaler.transform(X_test),  ...))
# Works, but I am repeating the preprocessing steps, fitting them "by hand", negating pipeline beauty

a_pipeline2 = make_pipeline(StandardScaler(),KerasClassifier(..., validation_data = a_pipeline2[:-1].transform(X_test),  ...))
# a_pipeline2 undefined error and even if it were defined, it would need to be fitted.
# I was trying to solve this with pipeline subsequence

preprocess = StandardScaler() #or more steps
preprocess.fit(X_train)
a_pipeline3 = make_pipeline(preprocess, KerasClassifier(..., validation_data = preprocess.transform(X_test),  ...))
#now we can encapsulate all preprocessing in one sub-pipeline, but still calling fit by hand
# could fit on wrong set, etc

(by the way, the use of a test set in a stopping rule is questionable. Keras conflates the two, not me)
If I could have it my way, validation_data would be a function that splits the input set as it sees fit and it would be used for all models, not just the ones that use early stopping or some such. I know there is score, but it's not the same. score lets the user provide X. If X is selected appropriately, we get a useful estimate, if not, not so much. I think evaluating a model should be a top-level concern, like fitting or predicting.

@jnothman
Copy link
Member

@jnothman jnothman commented Mar 3, 2018

You're right that I'm not sure this fixes the problem. I'm also not sure how to fit validation_data into our pipeline design. We would need a different pipeline definition, or some parameter that tells it how to specifically handle validation_data :/

@piccolbo
Copy link

@piccolbo piccolbo commented Mar 7, 2018

It doesn't seem to fit at all, I have to agree. I've been thinking about a computation graph type abstraction where train and test follow different paths before converging into a classifier, but it's hard when there are side effects between the nodes and some nodes need to be treated differently by fit and transform (test can't be used for fitting). It seems like one has just to write custom fit and transform methods. Would love to be proven wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.