NOCATS: Categorical splits for tree-based learners (ctnd.) #12866

adrinjalali · 2018-12-26T12:46:14Z

This PR continues the work of #4899. For now I've merged the master into the PR, made it compile and make the tests run. There are several issues which need to be fixed. The list will be updated as I encounter them. Also, not all of these items are necessarily open, I have only collected them from the comments on the original PR, and need to make sure they're either already addressed or address them.

merge master into the PR (done)
sparse tests pass (done)
- The code is supposed to be the same as the status quo implementation if categories are not passed. But right now the tests related to sparse data fail.
- EDIT: The tests pass if we compare floats with almost_equal
LabelEncoder -> CategoricalEncoder (done)
- Preprocessing is not a part of NOCATS anymore.
Is maximum random generations 20 or 40 (done)
- It's actually 60
Don't quantize features automatically (done)
- Doesn't happen anymore: #4899 (comment)
check the category count limits for given data. (done)
add a benchmark
- done. Results: #12866 (comment)
add tests (right now only invalid input are tested)
- tree/tests done
- ensemble/tests done
benchmark against master
add an example with plots
check numpy upgrade related issues (we've upgraded our numpy requirement in the meantime)
run some benchmarks with a simple integer coding of the features (with arbitrary ordering)
add cat_split to NODE_DTYPE once joblib.hash can handle it (padded struct)
- joblib issue: joblib/joblib#826

Closes #4899

Future Work: These are the possible future work we already know of (i.e. outside the scope of this PR):

Heuristic methods to allow fast Breiman-like training for multi-class classification
export to graphviz
One-hot emulation using the NOCATS machinery
support sparse input
handle categories as their unique valies instead of [0, max(feature)]
- This is to be consistent with our encoders' behavior
- moved this to future work per #12866 (comment)

P.S. I moved away from "task list" due to the extremely buggy interface when used in combination with editing the post, which I'm extensively doing to keep it easy for us to keep up with the status.

…causing all kinds of problems. Now safe_realloc requires the item size to be explicitly provided. Also, it can allocate arrays of pointers to any type by casting to void*.

…to categorical variables. Replaced the threshold attribute of SplitRecord and Node with SplitValue.

…hat defaults to -1 for each feature (indicating non-categorical).

…ediction with trees. Also introduced category caches for quick evaluation of categorical splits.

…he best categorical split.

…isionTree.

jnothman · 2018-12-26T12:59:26Z

Wow. Good on you for taking this on!

adrinjalali · 2018-12-26T14:25:19Z

I̶ ̶a̶s̶s̶u̶m̶e̶ ̶t̶h̶e̶ ̶a̶p̶p̶v̶e̶y̶o̶r̶ ̶f̶a̶i̶l̶u̶r̶e̶ ̶i̶s̶ ̶u̶n̶r̶e̶l̶a̶t̶e̶d̶ ̶t̶o̶ ̶t̶h̶i̶s̶ ̶P̶R̶ ̶I̶ ̶s̶u̶p̶p̶o̶s̶e̶.̶

NicolasHug · 2019-11-26T13:20:40Z

Also I'm super late to the party, but what is the benefit of NOCATs over One-Hot-Encoding the categories?
As far as I understand the strategy proposed here is equivalent to re-implementing the OHE within the tree logic. So what are the main benefits of NOCATs over OHE, apart from using less memory?

h-vetinari · 2019-11-26T14:29:25Z

Also I'm super late to the party, but what is the benefit of NOCATs over One-Hot-Encoding the categories?

One-hot encoding only allows you to split off 1-vs-the-rest, whereas the optimal split for a categorical variable may be many-vs-many. For example, the optimal split at a given node may be:

{A, B, C, D, E, F, G} --> {B, C, F} vs. {A, D, E, G}

but one-hot encoding would only be able to yield one of

{A, B, C, D, E, F, G} --> {A} vs. {B, C, D, E, F, G}
{A, B, C, D, E, F, G} --> {B} vs. {A, C, D, E, F, G}
{A, B, C, D, E, F, G} --> {C} vs. {A, B, D, E, F, G}
{A, B, C, D, E, F, G} --> {D} vs. {A, B, C, E, F, G}
{A, B, C, D, E, F, G} --> {E} vs. {A, B, C, D, F, G}
{A, B, C, D, E, F, G} --> {F} vs. {A, B, C, D, E, G}
{A, B, C, D, E, F, G} --> {G} vs. {A, B, C, D, E, F}

This obviously affects the depth / number of splits that are necessary to get a similarly good result.

adrinjalali · 2019-11-26T14:40:32Z

@NicolasHug this is only one benchmark, but at least on this dataset, there are benefits to using NOCATS: #12866 (comment)

amueller · 2019-12-02T22:57:38Z

@NicolasHug

What we plan to implement in the Histogram-GBDTs is what the paper does, except that we sort categories at each split instead of at the beginning (paper also discuss that)

That is the exact solution for regression and some binary cases.
This PR mentions in the beginning "Heuristic methods to allow fast Breiman-like training for multi-class classification" which is basically what you're implementing.

Thought I imagine you're doing this based on the unnormalized probabilities, which is one more level of indirection compared with the trees.

amueller · 2019-12-02T23:33:24Z

actually, because you're doing a regression tree each time, the sorting my always be exact, depending on the loss. I need to think about that again and look at the formula.

h-vetinari · 2020-03-09T08:40:26Z

@adrinjalali
Is there a chance you'll pick this up again in the foreseeable future? It's still a sore need in many situations, and forcing those affected to other packages than scikit-learn.

adrinjalali · 2020-03-09T09:31:37Z

We probably are going to implement this (or a version of it) for the new HistGradientBoosting ones instead. Realistically, we should have it in 6 months or so.

h-vetinari · 2020-03-09T09:37:20Z

Great to hear, thanks!

dlong11 · 2020-03-26T01:44:06Z

@adrinjalali Any idea of an ETA on this? Just planning a few projects and this feature would be great to have!

adrinjalali · 2020-03-26T10:19:59Z

we're planning to have a version of this, probably for HistGradientBoosting* in the October/November release.

h-vetinari · 2020-09-15T21:28:52Z

As someone who regulary comes back to this PR to check the status of NOCATS, I looked around what's currently happening - maybe others who have "only" subscribed to this thread are interested too:

Following the "Categorical" project that this PR is a part of, I found #15550, which fit well with the quote above:

@adrinjalali: we're planning to have a version of this, probably for HistGradientBoosting* in the October/November release.

This was being worked on in #16909, but - unsurprisingly, considering the current state of the world - was delayed and is at risk to get included for 0.24. The last proposal from @NicolasHug was to continue with a pared-down (non-C++) version for 0.24, which is currently being worked on in #18394.

In any case, very happy to see this moving forward! 🚀

RiverShah · 2020-12-22T12:52:45Z

+1 to see this in a release. catboost and other tree based libraries have good categorical handling but would like to see sklearn handle this out of the box

NicolasHug · 2020-12-23T09:02:22Z

For everyone interested: version 0.24 was just released with categorical support in the HistGradientBoosting estimators: https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_0_24_0.html#native-support-for-categorical-features-in-histgradientboosting-estimators

nehargupta · 2021-01-24T01:11:07Z

Hello,
I just wanted to check in and see if categorical implementation of decision trees might still happen in a future iteration? My team has been checking on this periodically, as we are hoping to see it sometime. We developed an open source package that uses feature selection in an intermediate step of our algorithm, so relying on one-hot-encoder is not ideal. I might be able to assist in some way, if needed, although I would be a new contributor to scikit. Thanks :)

SinaDBMS · 2021-05-29T09:20:08Z

@NicolasHug

Also I'm super late to the party, but what is the benefit of NOCATs over One-Hot-Encoding the categories?
As far as I understand the strategy proposed here is equivalent to re-implementing the OHE within the tree logic. So what are the main benefits of NOCATs over OHE, apart from using less memory?

Another drawback of One-Hot-Encoding is when the categorical feature to be encoded has a lot of possible values. This results in a large set of One-Hot features. So if a tree picks randomly a subset of the features for splitting, it is more likely that these one-hot-encoded features be picked up in comparison to the original features.

Jing25 · 2021-11-18T18:20:29Z

Hi I'm wondering if random forest has supported the categorical data?

AndreaTrucchia · 2022-02-22T11:39:56Z

Hallo, I would like to inquire about the status of this branch. My team would really benefit from this and be free from recurring to
R every now and then.

adrinjalali · 2022-02-24T10:50:50Z

@AndreaTrucchia have you checked HistGradientBoosting* instead?

AndreaTrucchia · 2022-02-24T10:54:10Z

@adrinjalali I am checking it, too bad most of my works concern Random Forest. However, I think that I can give it a try for
studies that revolves just on the effect of different categories on the predicted label. Thanks a lot

adrinjalali · 2022-02-24T12:25:48Z

Out of curiosity, do the preprocessing techniques we have to handle categorical variables not satisfy your needs in a Pipeline?

AndreaTrucchia · 2022-02-24T12:30:36Z

Dear @adrinjalali , while in a scikit-learn environment, I tend to one-hot-encode the categorical variable, with very high performances (see e.g. https://www.mdpi.com/2571-6255/5/1/30) .However, in the R (randomForest package) -style of treating canonical variables, I can use the partialPlot function that can rank the variables from let's say 1 "this category enhance the calssification of being label A" to -1 "this category strongly disagrees with classification of label A" .
I hope I was clear enough :)

adrinjalali · 2022-02-24T12:48:39Z

Isn't partialPlot the partial dependence plots that we have? (https://scikit-learn.org/stable/modules/partial_dependence.html#partial-dependence)

You could pass a pipeline with the OneHotEncoder included in it and get the partial dependence (I think).

NicolasHug · 2022-02-24T12:56:18Z

That would probably be a different thing. Our PDP support is only defined for regressors, not classifiers. The "partial dependence" as we support it is defined as the expectation of a continuous target.
EDIT nvm I'm wrong, it can rely on the decision_function

jblackburne and others added 20 commits Feb 11, 2017

Remove inverted dependence of _utils.pyx on _tree.pyx because it was …

61e28ec

…causing all kinds of problems. Now safe_realloc requires the item size to be explicitly provided. Also, it can allocate arrays of pointers to any type by casting to void*.

Tree constructor now checks for mismatched struct sizes.

926d48f

Created SplitValue datatype to generalize the concept of a threshold …

82e932e

…to categorical variables. Replaced the threshold attribute of SplitRecord and Node with SplitValue.

Added attribute n_categories to Splitter and Tree, an array of ints t…

5cfa6c2

…hat defaults to -1 for each feature (indicating non-categorical).

Added a goes_left function to replace threshold comparisons during pr…

a65b848

…ediction with trees. Also introduced category caches for quick evaluation of categorical splits.

BestSplitter now calculates the best categorical split.

2bd5633

Added categorical split code to RandomSplitter.node_split

e3f0a99

Added an implementation of the Breiman sorting shortcut for finding t…

35dad26

…he best categorical split.

Added categorical constructor parameter and error checking to BaseDec…

2312ce0

…isionTree.

Added the categorical keyword to forest constructors.

e0068f8

Added some unit tests.

1b77326

Refactored _partial_dependence_tree a little.

149b7bf

Added categorical support to gradient boosting.

b65da1c

compile with recent cython

04c5fa9

Merge remote-tracking branch 'upstream/master' into tree/nocats

1c6f5b9

Merge remote-tracking branch 'upstream/master' into tree/nocats

0becd7b

compile goes a bit further

95a0bd2

compiles

28f42b1

fix yield based tests

724e720

add cat_split to NODE_DTYPE

614e51c

adrinjalali added 3 commits Dec 26, 2018

compare float arrays with almost_equal

51e51cf

remove extra import

7840004

remove commented extra lines

3d12660

adrinjalali added 5 commits Dec 26, 2018

fix some docstrings

bb02337

remove overlapping DTYPE

56cdb72

fix forest doctest

d49ff0e

remove input validation from __init__, it's don in fit

081dd83

fix extra tree param docstring

51b908c

dylanking42 mentioned this pull request Jan 9, 2020

DecisionTreeClassifier doesn't distinguish between numerical and categorical data #12398

Closed

MatthieuBizien mentioned this pull request Jun 4, 2020

[WIP] Categorical split for decision tree #3346

Closed

adrinjalali mentioned this pull request Oct 15, 2020

Make Decision Tree accept Categorical variables #18623

Closed

Base automatically changed from master to main Jan 22, 2021

thomasjpfan added the cython label Apr 13, 2021

thomasjpfan mentioned this pull request Nov 29, 2021

Adding Oblique Trees (Forest-RC) to the Cythonized Tree Module #20819

Open

thomasjpfan mentioned this pull request Dec 9, 2021

categorical predictors in randomForest #21932

Open

scikit-learn / scikit-learn Public

NOCATS: Categorical splits for tree-based learners (ctnd.) #12866

NOCATS: Categorical splits for tree-based learners (ctnd.) #12866

adrinjalali commented Dec 26, 2018 •

edited

jnothman commented Dec 26, 2018

adrinjalali commented Dec 26, 2018 •

edited

NicolasHug commented Nov 26, 2019

h-vetinari commented Nov 26, 2019 •

edited

adrinjalali commented Nov 26, 2019

amueller commented Dec 2, 2019

amueller commented Dec 2, 2019

h-vetinari commented Mar 9, 2020

adrinjalali commented Mar 9, 2020

h-vetinari commented Mar 9, 2020

dlong11 commented Mar 26, 2020

adrinjalali commented Mar 26, 2020

h-vetinari commented Sep 15, 2020

RiverShah commented Dec 22, 2020

NicolasHug commented Dec 23, 2020

nehargupta commented Jan 24, 2021

SinaDBMS commented May 29, 2021 •

edited

Jing25 commented Nov 18, 2021

AndreaTrucchia commented Feb 22, 2022

adrinjalali commented Feb 24, 2022

AndreaTrucchia commented Feb 24, 2022

adrinjalali commented Feb 24, 2022

AndreaTrucchia commented Feb 24, 2022

adrinjalali commented Feb 24, 2022

NicolasHug commented Feb 24, 2022 •

edited

scikit-learn / scikit-learn Public

NOCATS: Categorical splits for tree-based learners (ctnd.) #12866

Are you sure you want to change the base?

NOCATS: Categorical splits for tree-based learners (ctnd.) #12866

Conversation

adrinjalali commented Dec 26, 2018 • edited

jnothman commented Dec 26, 2018

adrinjalali commented Dec 26, 2018 • edited

NicolasHug commented Nov 26, 2019

h-vetinari commented Nov 26, 2019 • edited

adrinjalali commented Nov 26, 2019

amueller commented Dec 2, 2019

amueller commented Dec 2, 2019

h-vetinari commented Mar 9, 2020

adrinjalali commented Mar 9, 2020

h-vetinari commented Mar 9, 2020

dlong11 commented Mar 26, 2020

adrinjalali commented Mar 26, 2020

h-vetinari commented Sep 15, 2020

RiverShah commented Dec 22, 2020

NicolasHug commented Dec 23, 2020

nehargupta commented Jan 24, 2021

SinaDBMS commented May 29, 2021 • edited

Jing25 commented Nov 18, 2021

AndreaTrucchia commented Feb 22, 2022

adrinjalali commented Feb 24, 2022

AndreaTrucchia commented Feb 24, 2022

adrinjalali commented Feb 24, 2022

AndreaTrucchia commented Feb 24, 2022

adrinjalali commented Feb 24, 2022

NicolasHug commented Feb 24, 2022 • edited

adrinjalali commented Dec 26, 2018 •

edited

adrinjalali commented Dec 26, 2018 •

edited

h-vetinari commented Nov 26, 2019 •

edited

SinaDBMS commented May 29, 2021 •

edited

NicolasHug commented Feb 24, 2022 •

edited