Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upNOCATS: Categorical splits for tree-based learners #4899
Conversation
Awesome! I will be on vacation for the next two weeks, but I will definitely look into it at my return. (Be patient, our review and integration process requires some time -- but dont hesitate to ping us if you see things stalling. ) |
Ok, thanks. Hm, it looks like there are two test errors. The first is easy; I need to use |
ping myself. looks awesome. I will unlock time to review this pr. |
it would be awesome if you add some tests. |
Fixed some bugs and addressed most of the caveats. Working on some unit tests. Code review welcome! |
there is a bunch of changes in master in the trees. Not sure how they relate to yours. Maybe try to rebase? Or check out the changes first? |
@jblackburne Could I help you in this PR? (That would involve me sending PR's to your branch "NOCATS" following reviews from our devs). Also you need to rebase it upon master first! :) |
HI @rvraghav93 Sure, PRs would be welcome, especially unit tests. The rebase is done, and I'm waiting to push it until I've had a chance to test it a little. Give me a couple days. |
Sure please take your time :) |
Ok, rebase is done. Anyone who has cloned this will need to re-clone it since I altered history. Travis-CI fails when numpy version < 1.7; this is a known problem. Don't know why the appveyor build was canceled. |
We can safely ignore appveyor for the time being... Thanks for the rebase! I'll clone your fork and send a PR to your branch soon! |
Also I think you could squash to <= 3 commits! It will be cleaner to trace back any regressions in the future! :) |
Never mind about the squash... We'll do it at the end... I've cloned your repo and started working on it... Will ping you when I'm done :) |
Could you update your master and rebase this branch again please? ;) (since c files are removed, you might have to check them out too) EDIT: I think rebase should do that... but I am not sure as you must have explicitly committed those c files previously... |
Here you go. Git didn't do it for me, but it was pretty easy anyway. |
Now that I am getting to know the tree code better, this PR looks amazing! One comment. Is not splitting based on the current subset of data the correct thing to do? Is it how Also could you compare your implementation with a dataset having categorical features vs the Thanks for the PR... |
Not splitting on the current subset of data causes two problems. The first is that it's not as fast (I have traded speed for algorithmic simplicity). This problem affects The second problem only affects TL;DR It's not incorrect (well, maybe once in a very great while). It makes I'm not sure how EDIT: Sorry, my math is wrong. It is a 50% chance in the example above, not 25%, because the trivial split can occur by sending both categories left OR right. So 20 iterations leads to a trivial split one time in a million, not one time in a trillion. Hm. I will push a new commit raising the maximum from 20 to 40, or maybe more. |
I have done some comparisons of NOCATS to one-hot encoding using a toy dataset, and convinced myself that things were working. I'll try and put together a more in-depth study with larger train/test datasets. Stay tuned. |
@jblackburne Thanks a lot for the patient response! |
Ok so the important question from the API point of view is to ask if we are okay with the @amueller @jnothman @GaelVaroquaux @glouppe @agramfort Views on the same? |
IMHO it should be a class parameter: as usual the question is: how do you do cross-val with categorical variables. |
If I am not missing something, we can pass the |
Yes, but only in the feature direction.
Yes, but then it becomes very clumbersome to use in a larger setting. |
Would it make sense to run the new code on the benchmarks from https://github.com/szilard/benchm-ml ? @GaelVaroquaux mentioned on the mailing list in relation to these benchmarks specifically that "In tree-based Not handling categorical variables as such hurts us a lot" |
@lesshaste: It looks like they are using decision tree-based classifiers (i.e., |
@jblackburne would you be willing to give me push access to this branch? It would make it easier for me to collaborate. I'll make sure I don't force push. And now the todo for this PR
(PS: I'm currently in a OpenML workshop. A lot of people here seem to want this feature!) |
Sorry one question - what's the view of the core team about this general approach? I had assumed that something much simpler would be done, which is to do exactly the same thing as 1-hot encoding, but in the faster and lower memory way that you can do if you have categorical variables (i.e. just allow a single 1-vs-rest split at each leaf). I haven't seen any upside in practice of supporting more complex splits where you pick multiple levels to split on - since in practice the tree can always handle that case with multiple 1-vs-rest splits in the tree. So what I'm trying to ask is: which approach do you guys feel is most interesting:
|
Others can probably explain this better than I can, but the general idea here is that in the presence of a categorical feature with multiple values, the optimal way to split the tree may be to partition multiple values at the same time. If you're using an integer encoding (aka LabelEncoder), your encoding may not be in the optimal ordering and it may not be possible to generate it in the optimal ordering for all cases. If you use one-hot encoding, the entropy reduction for partitioning that single value might not be beneficial enough for the algorithm to choose that route. A different way to say this is that currently supported approaches could theoretically reach the same conclusions, but it's very easy to concoct scenarios where it's highly unlikely to do so. Example: let's say you had 20 different values for a particular categorical value that have been integer encoded. In any particular part of the tree, the optimal split might be any one of the following:
|
I just wanted to complete the list that @raghavrv started: Listing down the Cat. Variable handling methods of other packages : XGBoost - dmlc/xgboost#95 (comment) - One hot encoding or Level encoding (No categorical splitting) |
Any news on the current status of this? I needed (wanted?) this feature so much I'm currently working on a local copy of this pull request, haha. |
@h-vetinari Only a few things remain to be done on this. It needs to be brought up to the latest changes in master, and more unit tests need to be written, as codecov has so helpfully pointed out. :) I could probably make time to do this. And then of course it needs to be reviewed. This is challenging, since it is a fairly substantial change to a fairly hairy section of the code. See @amueller's comment above. |
Given that I believe that this is one of the most requested features in sklearn (alongside with surrogate splits for natural null handling in trees), there should be quite a couple of people willing to test and benchmark this with different datasets (myself included) :) |
I am interested in seeing this feature as well. For those that are interested, how can we help push this over the finish line? Exactly what work is left (other than rebasing)? |
It needs a code review:
|
Doesn't look like svms or even random forest in sklearn handle categorical features: scikit-learn/scikit-learn#4899. They just get converted to enums. The SVM score improved, but random forest went down a bit. But that's probably because we now have more features for random forest, and will need to do hyperparam tuning later. Before: Random forest 0.822780047668 SVM 0.76217937805 After: Random forest 0.810420497106 SVM 0.795838156849
Not sure it's a good idea to add more features on top of an already big PR, so maybe that's for a follow up, but i think it would be good to add a multi-class heuristic for efficient splits. I've read of people doing one vs rest with the binary algorithm. |
Any update on the status of when this might be coming in? |
analogous to the (WIP) categorical support for sklearn.tree <scikit-learn/scikit-learn#4899 (comment)>
Hi @jblackburne, @raghavrv, Took me a while to go through this thread and the code. A lot has changed since two years ago, which I guess is the last commit on this branch. You think you've got time to rebase/merge master and we take it from there? |
(@raghavrv is sadly no longer with us.)
|
(I'm really sorry about that, and that I didn't realize). Alternatively, I can base a new PR on this one and try to address the list I gathered reading through this thread. @jblackburne what would you prefer? |
Closing in favor of #4899. |
You mean in favor of #12866 probably :) |
NOCATS stands for "Near-Optimal Categorical Algorithm Technology System". (What can I say? My coworker came up with it.) It adds support for categorical features to tree-based learners (e.g.,
DecisionTreeRegressor
orExtraTreesClassifier
).This PR is very similar to #3346, but allows for more categories, particularly with extra-randomized trees (see below).
How it works
We've replaced the threshold attribute of each node (a
float64
) with a union datatype containing afloat64 threshold
field and auint64 cat_split
field. When splitting on non-categorical features, we use thethreshold
field and everything works as before.But when a feature is marked as categorical, the
cat_split
field is used instead. In a decision tree, each of its 64 bits indicates which direction a certain category goes; this implies a hard maximum of 64 categories in any feature. Which is fine, because finding the best way to split 64 categories during the tree-building step is very expensive, and the practical limit will certainly be less than 64.In an extra-randomized tree, however, the expensive process of finding the very best split is bypassed, so it would be nice to allow more categories. So for these trees we use
cat_split
in a completely different way: when building the tree we randomly choose a set of categories to go left, then store only the minimum information needed to regenerate that set during tree evaluation. The information that we store is a random seed (in the most significant 32 bits)and the number of draws to perform (in the next 31 bits)[Edit: We now flip a virtual coin for each category, so the number of draws is no longer necessary]. By recreating the split information as needed in each node rather than storing it explicitly, we are able to support large numbers of categories without causing the classifiers to balloon in size.How does a tree know which type it is? We encode that information in the least significant bit of
cat_split
. If the LSB is 0, we treat it as a flag field; if it is 1, we treat it as a random seed and number of draws. We do not lose generality by forcing category 0 to always go right, since there is a left-right symmetry.One last detail: to avoid regenerating the random split for every sample during tree evaluation, we allocate a temporary buffer for each node large enough to serve as a bit field. The buffers are freed when evaluation finishes.
How to use it
The
fit
method of the relevant learners has a new optional parametercategorical
. You can give it an array of feature indices, a boolean array of lengthn_features
, or the strings'None'
(the default) or'All'
. Categorical feature data will be rounded to the nearest integer, then those integers will serve as the category labels. (Internally they are mapped torange(n_categories)
).Comments, caveats, etc.
[Edit:RandomSplitter
generates a random categorical split by first generating a random seed, then generating a number of draws to make. To simulate flipping a coin for each category, the number of draws should come from a Binomial distribution, but currently we use a uniform distribution. I welcome comments on how desirable it would be to change this into a Binomial draw.RandomSplitter
now sends each category left or right using a simple coin flip. This is equivalent to the Binomial draw.]BestSplitter
s, this means it will take longer to find the split. For theRandomSplitter
, it means there is a chance that the current subset will all be sent in the same direction. This contrasts with the non-categorical behavior, where a non-trivial split is guaranteed for non-constant features. The chance is generally small (and it's smaller if we use a Binomial draw rather than a uniform draw). I made this choice because it would introduce a lot of new complexity and storage requirements to split based on the current subset of categories.One alternative would be to have the[Edit: This is now implemented. Random splits are generated until a non-trivial split is found, or until a maximum of 20 tries (to limit the worst case runtime). This change renders this whole bullet point essentially meaningless aside from runtime speed considerations.] Comments on this are also welcome.RandomSplitter
generate random splits until a non-trivial split is achieved.