NOCATS: Categorical splits for tree-based learners #4899

jblackburne · 2015-06-25T18:15:47Z

NOCATS stands for "Near-Optimal Categorical Algorithm Technology System". (What can I say? My coworker came up with it.) It adds support for categorical features to tree-based learners (e.g., DecisionTreeRegressor or ExtraTreesClassifier).

This PR is very similar to #3346, but allows for more categories, particularly with extra-randomized trees (see below).

How it works

We've replaced the threshold attribute of each node (a float64) with a union datatype containing a float64 threshold field and a uint64 cat_split field. When splitting on non-categorical features, we use the threshold field and everything works as before.

But when a feature is marked as categorical, the cat_split field is used instead. In a decision tree, each of its 64 bits indicates which direction a certain category goes; this implies a hard maximum of 64 categories in any feature. Which is fine, because finding the best way to split 64 categories during the tree-building step is very expensive, and the practical limit will certainly be less than 64.

In an extra-randomized tree, however, the expensive process of finding the very best split is bypassed, so it would be nice to allow more categories. So for these trees we use cat_split in a completely different way: when building the tree we randomly choose a set of categories to go left, then store only the minimum information needed to regenerate that set during tree evaluation. The information that we store is a random seed (in the most significant 32 bits) ~~and the number of draws to perform (in the next 31 bits)~~ [Edit: We now flip a virtual coin for each category, so the number of draws is no longer necessary]. By recreating the split information as needed in each node rather than storing it explicitly, we are able to support large numbers of categories without causing the classifiers to balloon in size.

How does a tree know which type it is? We encode that information in the least significant bit of cat_split. If the LSB is 0, we treat it as a flag field; if it is 1, we treat it as a random seed and number of draws. We do not lose generality by forcing category 0 to always go right, since there is a left-right symmetry.

One last detail: to avoid regenerating the random split for every sample during tree evaluation, we allocate a temporary buffer for each node large enough to serve as a bit field. The buffers are freed when evaluation finishes.

How to use it

The fit method of the relevant learners has a new optional parameter categorical. You can give it an array of feature indices, a boolean array of length n_features, or the strings 'None' (the default) or 'All'. Categorical feature data will be rounded to the nearest integer, then those integers will serve as the category labels. (Internally they are mapped to range(n_categories)).

Comments, caveats, etc.

RandomSplitter generates a random categorical split by first generating a random seed, then generating a number of draws to make. To simulate flipping a coin for each category, the number of draws should come from a Binomial distribution, but currently we use a uniform distribution. I welcome comments on how desirable it would be to change this into a Binomial draw. [Edit: RandomSplitter now sends each category left or right using a simple coin flip. This is equivalent to the Binomial draw.]
When building the tree, each node generates its split using the full set of categories for the feature in question rather than the subset of categories represented by the node's samples. For the BestSplitters, this means it will take longer to find the split. For the RandomSplitter, it means there is a chance that the current subset will all be sent in the same direction. This contrasts with the non-categorical behavior, where a non-trivial split is guaranteed for non-constant features. The chance is generally small (and it's smaller if we use a Binomial draw rather than a uniform draw). I made this choice because it would introduce a lot of new complexity and storage requirements to split based on the current subset of categories. ~~One alternative would be to have the RandomSplitter generate random splits until a non-trivial split is achieved.~~ [Edit: This is now implemented. Random splits are generated until a non-trivial split is found, or until a maximum of 20 tries (to limit the worst case runtime). This change renders this whole bullet point essentially meaningless aside from runtime speed considerations.] Comments on this are also welcome.
Categorical features are not supported for sparse inputs. This is because I did most of this work before the support for sparse inputs was added, and I am not as familiar with that part of the code. Plus, it seems that sparse inputs become less necessary when you are not using one-hot encoding.

glouppe · 2015-06-25T18:48:31Z

Awesome! I will be on vacation for the next two weeks, but I will definitely look into it at my return.

(Be patient, our review and integration process requires some time -- but dont hesitate to ping us if you see things stalling. )

jblackburne · 2015-06-25T19:52:23Z

Ok, thanks.

Hm, it looks like there are two test errors. The first is easy; I need to use six.moves.zip instead of itertools.izip. The second is that older versions of Numpy apparently don't like union datatypes, at least the way I constructed it. Looks like I can fix it using this SO question.

arjoly · 2015-06-26T08:17:59Z

ping myself. looks awesome. I will unlock time to review this pr.

arjoly · 2015-06-26T08:21:15Z

it would be awesome if you add some tests.

jblackburne · 2015-10-15T02:21:46Z

Fixed some bugs and addressed most of the caveats. Working on some unit tests. Code review welcome!

amueller · 2015-10-15T15:34:51Z

there is a bunch of changes in master in the trees. Not sure how they relate to yours. Maybe try to rebase? Or check out the changes first?

raghavrv · 2015-11-04T14:05:48Z

@jblackburne Could I help you in this PR? (That would involve me sending PR's to your branch "NOCATS" following reviews from our devs). Also you need to rebase it upon master first! :)

jblackburne · 2015-11-04T17:45:43Z

HI @rvraghav93 Sure, PRs would be welcome, especially unit tests. The rebase is done, and I'm waiting to push it until I've had a chance to test it a little. Give me a couple days.

raghavrv · 2015-11-04T17:49:09Z

Sure please take your time :)

jblackburne · 2015-11-05T23:45:44Z

Ok, rebase is done. Anyone who has cloned this will need to re-clone it since I altered history. Travis-CI fails when numpy version < 1.7; this is a known problem. Don't know why the appveyor build was canceled.

raghavrv · 2015-11-06T00:02:56Z

We can safely ignore appveyor for the time being... Thanks for the rebase! I'll clone your fork and send a PR to your branch soon!

raghavrv · 2015-11-06T00:05:48Z

Also I think you could squash to <= 3 commits! It will be cleaner to trace back any regressions in the future! :)
Also a minor tip (which you can choose to ignore) you could prefix the commit headers with tags ENH / FIX / MAINT and put all the squashed description inside if you feel that is necessary...

raghavrv · 2015-11-06T11:05:54Z

Never mind about the squash... We'll do it at the end... I've cloned your repo and started working on it... Will ping you when I'm done :)

raghavrv · 2015-11-15T12:14:55Z

Could you update your master and rebase this branch again please? ;) (since c files are removed, you might have to check them out too)

EDIT: I think rebase should do that... but I am not sure as you must have explicitly committed those c files previously...

jblackburne · 2015-11-16T20:06:04Z

Here you go. Git didn't do it for me, but it was pretty easy anyway.

raghavrv · 2016-02-17T17:42:44Z

Now that I am getting to know the tree code better, this PR looks amazing!

One comment. Is not splitting based on the current subset of data the correct thing to do? Is it how r handles it?

Also could you compare your implementation with a dataset having categorical features vs the master branch for accuracy variations (by simply encoding those categorical features)?

Thanks for the PR...

jblackburne · 2016-02-18T01:04:45Z

Not splitting on the current subset of data causes two problems.

The first is that it's not as fast (I have traded speed for algorithmic simplicity). This problem affects DecisionTree more than ExtraTree because the former must test every possible permutation of categories when fitting, and where factorials are concerned, smaller arguments are much better! But I'm hoping that it's not too bad compared to one-hot, for the values of n_categories that people will be using. This is not a problem for ExtraTree, and honestly I'm more excited about that one anyway, because it allows you to have really large n_categories.

The second problem only affects ExtraTree. There's a chance that the random permutation that is chosen will result in a trivial split (meaning that it will send all samples to one child) despite there being a variety of categories present for the chosen feature. For example, if the sample consisted of three "smoky" and two "effervescent" and zero "swirly", this would happen if the RandomSplitter randomly sent "swirly" right and the other two left (a 25% chance). Because it's not restricting itself to "smoky" and "effervescent", it doesn't know that it has selected a trivial split. This is the incorrect thing to do if you consider that the baseline (non-categorical) RandomSplitter will never make this mistake. You can see that it's more likely to happen with fewer categories represented in the current sample, so 25% is as bad as it gets. RandomSplitter currently works around this by re-rolling until it gets a nontrivial split, up to a maximum of 20 re-rolls. In the case above, this reduces the chances of a trivial split to 0.25**20, or about a part in a trillion.

TL;DR It's not incorrect (well, maybe once in a very great while). It makes DecisionTree slower than it could be for categorical features, but I think it's good enough for now.

I'm not sure how R's implementation works under the hood, unfortunately.

EDIT: Sorry, my math is wrong. It is a 50% chance in the example above, not 25%, because the trivial split can occur by sending both categories left OR right. So 20 iterations leads to a trivial split one time in a million, not one time in a trillion. Hm. I will push a new commit raising the maximum from 20 to 40, or maybe more.

jblackburne · 2016-02-18T01:06:45Z

I have done some comparisons of NOCATS to one-hot encoding using a toy dataset, and convinced myself that things were working. I'll try and put together a more in-depth study with larger train/test datasets. Stay tuned.

raghavrv · 2016-03-01T12:51:10Z

@jblackburne Thanks a lot for the patient response!

raghavrv · 2016-03-14T14:34:35Z

Ok so the important question from the API point of view is to ask if we are okay with the categorical parameter in fit?

@amueller @jnothman @GaelVaroquaux @glouppe @agramfort Views on the same?

GaelVaroquaux · 2016-03-14T14:36:51Z

IMHO it should be a class parameter: as usual the question is: how do you do cross-val with categorical variables.

raghavrv · 2016-03-14T15:20:41Z

a class parameter

categorical becomes data dependent... I'm not sure if we want it as a class param??

how do you do cross-val with categorical variables.

If I am not missing something, we can pass the categorical parameter inside fit_params dict correct?

GaelVaroquaux · 2016-03-14T15:26:55Z

categorical becomes data dependent... I'm not sure if we want it as a class
param??

Yes, but only in the feature direction.

how do you do cross-val with categorical variables.
If I am not missing something, we can pass the categorical parameter inside
fit_params dict correct?

Yes, but then it becomes very clumbersome to use in a larger setting.

lesshaste · 2016-03-18T08:53:43Z

Would it make sense to run the new code on the benchmarks from https://github.com/szilard/benchm-ml ? @GaelVaroquaux mentioned on the mailing list in relation to these benchmarks specifically that "In tree-based Not handling categorical variables as such hurts us a lot"

jblackburne · 2016-03-18T20:56:15Z

@lesshaste: It looks like they are using decision tree-based classifiers (i.e., RandomForestClassifier and GradientBoostingClassifier) rather than extra-random tree-based classifiers. And it looks like their dataset's categorical features (airlines, origin & destination airports) probably have cardinality > 64. These two factors together mean NOCATS can't be used.

raghavrv · 2016-03-18T21:08:19Z

@jblackburne would you be willing to give me push access to this branch? It would make it easier for me to collaborate. I'll make sure I don't force push.

And now the todo for this PR

Move categorical from fit to class parameter.
Make node based categorical splitting.
Benchmarking with master (one hot encoding) - Thanks Jblackbrune for doing this!

(PS: I'm currently in a OpenML workshop. A lot of people here seem to want this feature!)

jph00 · 2017-08-02T00:18:38Z

Sorry one question - what's the view of the core team about this general approach? I had assumed that something much simpler would be done, which is to do exactly the same thing as 1-hot encoding, but in the faster and lower memory way that you can do if you have categorical variables (i.e. just allow a single 1-vs-rest split at each leaf). I haven't seen any upside in practice of supporting more complex splits where you pick multiple levels to split on - since in practice the tree can always handle that case with multiple 1-vs-rest splits in the tree.

So what I'm trying to ask is: which approach do you guys feel is most interesting:

Fast, low-memory, 1-vs-rest splits (i.e. supports same functionality as one-hot encoding)
More complex multi-level splits like in this PR
Or neither - just let users do integer or 1-hot coding themselves.

jimmywan · 2017-09-11T17:22:56Z

I haven't seen any upside in practice of supporting more complex splits where you pick multiple levels to split on - since in practice the tree can always handle that case with multiple 1-vs-rest splits in the tree.

Others can probably explain this better than I can, but the general idea here is that in the presence of a categorical feature with multiple values, the optimal way to split the tree may be to partition multiple values at the same time.

If you're using an integer encoding (aka LabelEncoder), your encoding may not be in the optimal ordering and it may not be possible to generate it in the optimal ordering for all cases.

If you use one-hot encoding, the entropy reduction for partitioning that single value might not be beneficial enough for the algorithm to choose that route.

A different way to say this is that currently supported approaches could theoretically reach the same conclusions, but it's very easy to concoct scenarios where it's highly unlikely to do so.

Example: let's say you had 20 different values for a particular categorical value that have been integer encoded. In any particular part of the tree, the optimal split might be any one of the following:

"odd vs even"
"split by the midpoint"
"numbers divisible by 7"
etc.

julioasotodv · 2017-09-24T14:20:00Z

I just wanted to complete the list that @raghavrv started:

Listing down the Cat. Variable handling methods of other packages :

XGBoost - dmlc/xgboost#95 (comment) - One hot encoding or Level encoding (No categorical splitting)
randomForest - http://stats.stackexchange.com/a/96442/58790 - The same way as this PR
(sends some labels left and others right)
rpart - Not clear
gbm - Found no info
weka - Does not (needs one hot encoding)
H2O - http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm-faq/histograms_and_binning.html (Using bitsets, seems to be very efficient and accurate)
Spark ML - Naturally handles categorical features, but only up to the maxBins hyperparameter, given that all features are histogram binned (I still have to browse through the source code)

h-vetinari · 2017-11-15T09:35:51Z

Any news on the current status of this? I needed (wanted?) this feature so much I'm currently working on a local copy of this pull request, haha.

jblackburne · 2017-11-17T00:31:50Z

@h-vetinari Only a few things remain to be done on this. It needs to be brought up to the latest changes in master, and more unit tests need to be written, as codecov has so helpfully pointed out. :) I could probably make time to do this.

And then of course it needs to be reviewed. This is challenging, since it is a fairly substantial change to a fairly hairy section of the code. See @amueller's comment above.

julioasotodv · 2017-11-17T00:48:26Z

Given that I believe that this is one of the most requested features in sklearn (alongside with surrogate splits for natural null handling in trees), there should be quite a couple of people willing to test and benchmark this with different datasets (myself included) :)

js3711 · 2018-01-09T15:53:56Z

I am interested in seeing this feature as well. For those that are interested, how can we help push this over the finish line? Exactly what work is left (other than rebasing)?

jnothman · 2018-01-09T22:14:05Z

It needs a code review:

Check that tests are understandable and adequate to test the new functionality
Check that the implementation does not present substantial risks to existing functionality (including memory leaks)
Check that the implementation is readable / maintainable and there are no obvious ways to improve that
Check that the API is well designed

Doesn't look like svms or even random forest in sklearn handle categorical features: scikit-learn/scikit-learn#4899. They just get converted to enums. The SVM score improved, but random forest went down a bit. But that's probably because we now have more features for random forest, and will need to do hyperparam tuning later. Before: Random forest 0.822780047668 SVM 0.76217937805 After: Random forest 0.810420497106 SVM 0.795838156849

amueller · 2018-03-06T14:03:15Z

Not sure it's a good idea to add more features on top of an already big PR, so maybe that's for a follow up, but i think it would be good to add a multi-class heuristic for efficient splits. I've read of people doing one vs rest with the binary algorithm.

dipanjanS · 2018-07-22T08:45:27Z

Any update on the status of when this might be coming in?

analogous to the (WIP) categorical support for sklearn.tree <scikit-learn/scikit-learn#4899 (comment)>

adrinjalali · 2018-10-08T14:37:54Z

Hi @jblackburne, @raghavrv,

Took me a while to go through this thread and the code. A lot has changed since two years ago, which I guess is the last commit on this branch.

You think you've got time to rebase/merge master and we take it from there?

jnothman · 2018-10-08T20:27:30Z

(@raghavrv is sadly no longer with us.)

adrinjalali · 2018-10-11T08:09:06Z

(I'm really sorry about that, and that I didn't realize).

Alternatively, I can base a new PR on this one and try to address the list I gathered reading through this thread. @jblackburne what would you prefer?

ogrisel · 2019-10-17T15:39:03Z

Closing in favor of #4899.

adrinjalali · 2019-10-17T15:44:44Z

You mean in favor of #12866 probably :)

jblackburne force-pushed the jblackburne:NOCATS branch from 847f442 to bfd6bb0 Nov 5, 2015

raghavrv mentioned this pull request Nov 10, 2015

[RFC] Tree module improvements #5212

Open

5 of 12 tasks complete

jblackburne force-pushed the jblackburne:NOCATS branch from bfd6bb0 to 224949a Nov 16, 2015

lesshaste mentioned this pull request Nov 19, 2015

Implement Gower Similarity Coefficient #5884

Open

TomDLT mentioned this pull request Oct 20, 2017

Tree Optimal split - from LightGBM #9960

Closed

scikit-learn deleted a comment from codecov bot Oct 20, 2017

jnothman mentioned this pull request Jan 4, 2018

A necessary feature for Decision Trees? #10399

Closed

amueller mentioned this pull request Mar 23, 2018

Categorical Naive Bayes not available #10856

Closed

jnothman mentioned this pull request Jun 14, 2018

Support for Enum-Encoding as in H2o #11258

Open

jnothman mentioned this pull request Aug 29, 2018

IsolationForest with categorical attributes #11927

Closed

azrdev added a commit to azrdev/sklearn-seco that referenced this pull request Sep 1, 2018

make categorical a class attribute instead of a fit() parameter

abd5adb

analogous to the (WIP) categorical support for sklearn.tree <scikit-learn/scikit-learn#4899 (comment)>

jnothman mentioned this pull request Sep 29, 2018

Tree-based models are very hard to introspect #7613

Closed

amueller mentioned this pull request Oct 19, 2018

DecisionTreeClassifier doesn't distinguish between numerical and categorical data #12398

Closed

adrinjalali mentioned this pull request Dec 26, 2018

NOCATS: Categorical splits for tree-based learners (ctnd.) #12866

Open

lukauskas mentioned this pull request Feb 12, 2019

Be careful if you are using this package! christophM/rulefit#27

Open

scikit-learn deleted a comment from woodrujm Mar 13, 2019

amueller added the Superseded label Aug 6, 2019

ogrisel closed this Oct 17, 2019

scikit-learn / scikit-learn

Sponsor scikit-learn/scikit-learn

Join GitHub today

NOCATS: Categorical splits for tree-based learners #4899

NOCATS: Categorical splits for tree-based learners #4899

Conversation

jblackburne commented Jun 25, 2015

How it works

How to use it

Comments, caveats, etc.

glouppe commented Jun 25, 2015

jblackburne commented Jun 25, 2015

arjoly commented Jun 26, 2015

arjoly commented Jun 26, 2015

jblackburne commented Oct 15, 2015

amueller commented Oct 15, 2015

raghavrv commented Nov 4, 2015

jblackburne commented Nov 4, 2015

raghavrv commented Nov 4, 2015

jblackburne commented Nov 5, 2015

raghavrv commented Nov 6, 2015

raghavrv commented Nov 6, 2015

raghavrv commented Nov 6, 2015

raghavrv commented Nov 15, 2015

jblackburne commented Nov 16, 2015

raghavrv commented Feb 17, 2016

jblackburne commented Feb 18, 2016

jblackburne commented Feb 18, 2016

raghavrv commented Mar 1, 2016

raghavrv commented Mar 14, 2016

GaelVaroquaux commented Mar 14, 2016

raghavrv commented Mar 14, 2016

GaelVaroquaux commented Mar 14, 2016

lesshaste commented Mar 18, 2016

jblackburne commented Mar 18, 2016

raghavrv commented Mar 18, 2016 • edited

jph00 commented Aug 2, 2017

jimmywan commented Sep 11, 2017

julioasotodv commented Sep 24, 2017

h-vetinari commented Nov 15, 2017 • edited

jblackburne commented Nov 17, 2017

julioasotodv commented Nov 17, 2017

js3711 commented Jan 9, 2018

jnothman commented Jan 9, 2018

amueller commented Mar 6, 2018

dipanjanS commented Jul 22, 2018

adrinjalali commented Oct 8, 2018

jnothman commented Oct 8, 2018

adrinjalali commented Oct 11, 2018

ogrisel commented Oct 17, 2019

adrinjalali commented Oct 17, 2019

Essential cookies

Always active

Analytics cookies

raghavrv commented Mar 18, 2016 •

edited

h-vetinari commented Nov 15, 2017 •

edited