Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NOCATS: Categorical splits for tree-based learners (ctnd.) #12866

Open
wants to merge 59 commits into
base: main
Choose a base branch
from

Conversation

Copy link
Member

@adrinjalali adrinjalali commented Dec 26, 2018

This PR continues the work of #4899. For now I've merged the master into the PR, made it compile and make the tests run. There are several issues which need to be fixed. The list will be updated as I encounter them. Also, not all of these items are necessarily open, I have only collected them from the comments on the original PR, and need to make sure they're either already addressed or address them.

  • merge master into the PR (done)
  • sparse tests pass (done)
    • The code is supposed to be the same as the status quo implementation if categories are not passed. But right now the tests related to sparse data fail.
    • EDIT: The tests pass if we compare floats with almost_equal
  • LabelEncoder -> CategoricalEncoder (done)
    • Preprocessing is not a part of NOCATS anymore.
  • Is maximum random generations 20 or 40 (done)
    • It's actually 60
  • Don't quantize features automatically (done)
  • check the category count limits for given data. (done)
  • add a benchmark
  • add tests (right now only invalid input are tested)
    • tree/tests done
    • ensemble/tests done
  • benchmark against master
  • add an example with plots
  • check numpy upgrade related issues (we've upgraded our numpy requirement in the meantime)
  • run some benchmarks with a simple integer coding of the features (with arbitrary ordering)
  • add cat_split to NODE_DTYPE once joblib.hash can handle it (padded struct)

Closes #4899

Future Work: These are the possible future work we already know of (i.e. outside the scope of this PR):

  • Heuristic methods to allow fast Breiman-like training for multi-class classification
  • export to graphviz
  • One-hot emulation using the NOCATS machinery
  • support sparse input
  • handle categories as their unique valies instead of [0, max(feature)]
    • This is to be consistent with our encoders' behavior
    • moved this to future work per #12866 (comment)

P.S. I moved away from "task list" due to the extremely buggy interface when used in combination with editing the post, which I'm extensively doing to keep it easy for us to keep up with the status.

jblackburne and others added 20 commits Feb 11, 2017
…causing all kinds of problems. Now safe_realloc requires the item size to be explicitly provided. Also, it can allocate arrays of pointers to any type by casting to void*.
…to categorical variables. Replaced the threshold attribute of SplitRecord and Node with SplitValue.
…hat defaults to -1 for each feature (indicating non-categorical).
…ediction with trees. Also introduced category caches for quick evaluation of categorical splits.
@jnothman
Copy link

@jnothman jnothman commented Dec 26, 2018

Wow. Good on you for taking this on!

@adrinjalali
Copy link
Author

@adrinjalali adrinjalali commented Dec 26, 2018

I̶ ̶a̶s̶s̶u̶m̶e̶ ̶t̶h̶e̶ ̶a̶p̶p̶v̶e̶y̶o̶r̶ ̶f̶a̶i̶l̶u̶r̶e̶ ̶i̶s̶ ̶u̶n̶r̶e̶l̶a̶t̶e̶d̶ ̶t̶o̶ ̶t̶h̶i̶s̶ ̶P̶R̶ ̶I̶ ̶s̶u̶p̶p̶o̶s̶e̶.̶

@NicolasHug
Copy link

@NicolasHug NicolasHug commented Nov 26, 2019

Also I'm super late to the party, but what is the benefit of NOCATs over One-Hot-Encoding the categories?
As far as I understand the strategy proposed here is equivalent to re-implementing the OHE within the tree logic. So what are the main benefits of NOCATs over OHE, apart from using less memory?

@h-vetinari
Copy link

@h-vetinari h-vetinari commented Nov 26, 2019

Also I'm super late to the party, but what is the benefit of NOCATs over One-Hot-Encoding the categories?

One-hot encoding only allows you to split off 1-vs-the-rest, whereas the optimal split for a categorical variable may be many-vs-many. For example, the optimal split at a given node may be:

{A, B, C, D, E, F, G} --> {B, C, F} vs. {A, D, E, G}

but one-hot encoding would only be able to yield one of

{A, B, C, D, E, F, G} --> {A} vs. {B, C, D, E, F, G}
{A, B, C, D, E, F, G} --> {B} vs. {A, C, D, E, F, G}
{A, B, C, D, E, F, G} --> {C} vs. {A, B, D, E, F, G}
{A, B, C, D, E, F, G} --> {D} vs. {A, B, C, E, F, G}
{A, B, C, D, E, F, G} --> {E} vs. {A, B, C, D, F, G}
{A, B, C, D, E, F, G} --> {F} vs. {A, B, C, D, E, G}
{A, B, C, D, E, F, G} --> {G} vs. {A, B, C, D, E, F}

This obviously affects the depth / number of splits that are necessary to get a similarly good result.

@adrinjalali
Copy link
Author

@adrinjalali adrinjalali commented Nov 26, 2019

@NicolasHug this is only one benchmark, but at least on this dataset, there are benefits to using NOCATS: #12866 (comment)

@amueller
Copy link

@amueller amueller commented Dec 2, 2019

@NicolasHug

What we plan to implement in the Histogram-GBDTs is what the paper does, except that we sort categories at each split instead of at the beginning (paper also discuss that)

That is the exact solution for regression and some binary cases.
This PR mentions in the beginning "Heuristic methods to allow fast Breiman-like training for multi-class classification" which is basically what you're implementing.

Thought I imagine you're doing this based on the unnormalized probabilities, which is one more level of indirection compared with the trees.

@amueller
Copy link

@amueller amueller commented Dec 2, 2019

actually, because you're doing a regression tree each time, the sorting my always be exact, depending on the loss. I need to think about that again and look at the formula.

@h-vetinari
Copy link

@h-vetinari h-vetinari commented Mar 9, 2020

@adrinjalali
Is there a chance you'll pick this up again in the foreseeable future? It's still a sore need in many situations, and forcing those affected to other packages than scikit-learn.

@adrinjalali
Copy link
Author

@adrinjalali adrinjalali commented Mar 9, 2020

We probably are going to implement this (or a version of it) for the new HistGradientBoosting ones instead. Realistically, we should have it in 6 months or so.

@h-vetinari
Copy link

@h-vetinari h-vetinari commented Mar 9, 2020

Great to hear, thanks!

@dlong11
Copy link

@dlong11 dlong11 commented Mar 26, 2020

@adrinjalali Any idea of an ETA on this? Just planning a few projects and this feature would be great to have!

@adrinjalali
Copy link
Author

@adrinjalali adrinjalali commented Mar 26, 2020

we're planning to have a version of this, probably for HistGradientBoosting* in the October/November release.

@h-vetinari
Copy link

@h-vetinari h-vetinari commented Sep 15, 2020

As someone who regulary comes back to this PR to check the status of NOCATS, I looked around what's currently happening - maybe others who have "only" subscribed to this thread are interested too:

Following the "Categorical" project that this PR is a part of, I found #15550, which fit well with the quote above:

@adrinjalali: we're planning to have a version of this, probably for HistGradientBoosting* in the October/November release.

This was being worked on in #16909, but - unsurprisingly, considering the current state of the world - was delayed and is at risk to get included for 0.24. The last proposal from @NicolasHug was to continue with a pared-down (non-C++) version for 0.24, which is currently being worked on in #18394.

In any case, very happy to see this moving forward! 🚀

@RiverShah
Copy link

@RiverShah RiverShah commented Dec 22, 2020

+1 to see this in a release. catboost and other tree based libraries have good categorical handling but would like to see sklearn handle this out of the box

@NicolasHug
Copy link

@NicolasHug NicolasHug commented Dec 23, 2020

For everyone interested: version 0.24 was just released with categorical support in the HistGradientBoosting estimators: https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_0_24_0.html#native-support-for-categorical-features-in-histgradientboosting-estimators

Base automatically changed from master to main Jan 22, 2021
@nehargupta
Copy link

@nehargupta nehargupta commented Jan 24, 2021

Hello,
I just wanted to check in and see if categorical implementation of decision trees might still happen in a future iteration? My team has been checking on this periodically, as we are hoping to see it sometime. We developed an open source package that uses feature selection in an intermediate step of our algorithm, so relying on one-hot-encoder is not ideal. I might be able to assist in some way, if needed, although I would be a new contributor to scikit. Thanks :)

@SinaDBMS
Copy link

@SinaDBMS SinaDBMS commented May 29, 2021

@NicolasHug

Also I'm super late to the party, but what is the benefit of NOCATs over One-Hot-Encoding the categories?
As far as I understand the strategy proposed here is equivalent to re-implementing the OHE within the tree logic. So what are the main benefits of NOCATs over OHE, apart from using less memory?

Another drawback of One-Hot-Encoding is when the categorical feature to be encoded has a lot of possible values. This results in a large set of One-Hot features. So if a tree picks randomly a subset of the features for splitting, it is more likely that these one-hot-encoded features be picked up in comparison to the original features.

@Jing25
Copy link

@Jing25 Jing25 commented Nov 18, 2021

Hi I'm wondering if random forest has supported the categorical data?

@AndreaTrucchia
Copy link

@AndreaTrucchia AndreaTrucchia commented Feb 22, 2022

Hallo, I would like to inquire about the status of this branch. My team would really benefit from this and be free from recurring to
R every now and then.

@adrinjalali
Copy link
Author

@adrinjalali adrinjalali commented Feb 24, 2022

@AndreaTrucchia have you checked HistGradientBoosting* instead?

@AndreaTrucchia
Copy link

@AndreaTrucchia AndreaTrucchia commented Feb 24, 2022

@adrinjalali I am checking it, too bad most of my works concern Random Forest. However, I think that I can give it a try for
studies that revolves just on the effect of different categories on the predicted label. Thanks a lot

@adrinjalali
Copy link
Author

@adrinjalali adrinjalali commented Feb 24, 2022

Out of curiosity, do the preprocessing techniques we have to handle categorical variables not satisfy your needs in a Pipeline?

@AndreaTrucchia
Copy link

@AndreaTrucchia AndreaTrucchia commented Feb 24, 2022

Dear @adrinjalali , while in a scikit-learn environment, I tend to one-hot-encode the categorical variable, with very high performances (see e.g. https://www.mdpi.com/2571-6255/5/1/30) .However, in the R (randomForest package) -style of treating canonical variables, I can use the partialPlot function that can rank the variables from let's say 1 "this category enhance the calssification of being label A" to -1 "this category strongly disagrees with classification of label A" .
I hope I was clear enough :)

@adrinjalali
Copy link
Author

@adrinjalali adrinjalali commented Feb 24, 2022

Isn't partialPlot the partial dependence plots that we have? (https://scikit-learn.org/stable/modules/partial_dependence.html#partial-dependence)

You could pass a pipeline with the OneHotEncoder included in it and get the partial dependence (I think).

@NicolasHug
Copy link

@NicolasHug NicolasHug commented Feb 24, 2022

That would probably be a different thing. Our PDP support is only defined for regressors, not classifiers. The "partial dependence" as we support it is defined as the expectation of a continuous target.
EDIT nvm I'm wrong, it can rely on the decision_function

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Categorical
  
To do
Development

Successfully merging this pull request may close these issues.

None yet