Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upPipeline: apply all transformations except the last classifier #8414
Comments
Thanks. I tend to do this as Pipeline(my_pipeline.steps[:-1]). transform()
but I can see how dedicating a method for this makes sense. One argument
against the proposal is that I am trying to make the transformer interface
richer (e.g. to get feature descriptions), and any new methods would be
available if you create a new pipeline as in my approach, but would need
further methods for yours. I've also considered adding a method to pull out
any subsequence of steps as a new pipeline. One approach (#2568), although
@GaelVaroquaux derided it as too magical, would allow my_pipeline[:-
1].transform(). Would that suit your purposes?
|
my_pipeline[:-1].transform() would be great yes. It would probably need examples because [:-1] is hard to google. Regarding feature description, currently I am using this code to get feature description + ranking by tree ensembles. The main issue is inconsistency between Edit: this should probably be discussed in another thread. Also you probably referenced the wrong issue (2013) |
Well, you shouldn't really be using LabelBinarizer for features. But there
are issues in Github for each of these things.
In terms of getting lists of important features, I think you might be
interested in eli5 and the Pipeline support I've proposed there:
TeamHG-Memex/eli5#158
…On 21 February 2017 at 11:41, Mamy Ratsimbazafy ***@***.***> wrote:
my_pipeline[:-1].transform() would be great yes. It would probably need
examples because [:-1] is hard to google.
Regarding feature description, currently I am using this code to get
feature description + ranking by tree ensembles
https://github.com/mratsim/MachineLearning_Kaggle/blob/
master/Kaggle%20-%20001%20-%20Titanic%20Survivors/Kaggle-
001-Python-MagicalForest.py#L526
It walks through all transformations in a pipeline.
The main issue is inconsistency between OneEncoder (active_features_) and
LabelBinarizer (classes_) at least. Furthermore for n-category features
LabeBinarizer return n columns except if there are 2 (1 col).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8414 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67tb7FzlPzXm02qA1LlGTs3HKASPks5rejK-gaJpZM4MGqKl>
.
|
Conceptually Fixes scikit-learn#8414 and related issues. Alternative to scikit-learn#2568 without __getitem__ and mixed semantics. Designed to assist in model inspection and particularly to replicate the composite transformer represented by steps of the pipeline with the exception of the last. I.e. pipe.get_subsequence(0, -1) is a common idiom. I feel like this becomes more necessary when considering more API-consistent clone behaviour as per scikit-learn#8350 as Pipeline(pipe.steps[:-1]) is no longer possible.
Entering the conversation a little late, sorry, but concerning the reference to
(by the way, the use of a test set in a stopping rule is questionable. Keras conflates the two, not me) |
You're right that I'm not sure this fixes the problem. I'm also not sure how to fit validation_data into our pipeline design. We would need a different pipeline definition, or some parameter that tells it how to specifically handle validation_data :/ |
It doesn't seem to fit at all, I have to agree. I've been thinking about a computation graph type abstraction where train and test follow different paths before converging into a classifier, but it's hard when there are side effects between the nodes and some nodes need to be treated differently by fit and transform (test can't be used for fitting). It seems like one has just to write custom fit and transform methods. Would love to be proven wrong. |
Pipeline
should provide a method to apply its transformations to an arbitrary dataset withouttransform
from the last classifier step.Use case:
Boosted tree models like
XGBoost
andLightGBM
use a validation set for early stopping.We can trivially apply the pipeline to train and test via fit and predict but not for the validation set.
After raising the issue and proposing 2 ideas at LightGBM, microsoft/LightGBM#299 and XGBoost, dmlc/xgboost#2039, I believe it should be handled at
Scikit-learn
level.Idea 1, have a dummy
transform
method inXGBClassifier
andLGBMClassifier
The
transform
method for pipeline/classifier is already extremely inconsistent :Furthermore the issue will pop up again if the last classifier is an ensemble of multiple models
Idea 2, Implement a validation_split parameter for early stopping
Early stopping in KerasClassifier is controlled by a validation_split parameter.
At first I thought that could be used in
XGBClassifier
andLGBMClassifier
and everything else that would need a validation set for early stopping.The issue here is that there is no control over the validation set and split. Furthermore if there is a need to inspect deeper into validation issues, I suppose it would be non-trivial to extract it from the classifier or provide an API for it.
Hence I think
Scikit-learn
need a method or parameter in transform to ignore the last step or the last n steps.If needed I can raise a related issue on having a consistent
transform
method for classifiers and keep this one focused on applyingtransform
without classification on arbitrary data.