New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NOCATS: Categorical splits for tree-based learners (ctnd.) #12866
base: main
Are you sure you want to change the base?
Conversation
…causing all kinds of problems. Now safe_realloc requires the item size to be explicitly provided. Also, it can allocate arrays of pointers to any type by casting to void*.
…to categorical variables. Replaced the threshold attribute of SplitRecord and Node with SplitValue.
…hat defaults to -1 for each feature (indicating non-categorical).
…ediction with trees. Also introduced category caches for quick evaluation of categorical splits.
…he best categorical split.
Wow. Good on you for taking this on! |
I̶ ̶a̶s̶s̶u̶m̶e̶ ̶t̶h̶e̶ ̶a̶p̶p̶v̶e̶y̶o̶r̶ ̶f̶a̶i̶l̶u̶r̶e̶ ̶i̶s̶ ̶u̶n̶r̶e̶l̶a̶t̶e̶d̶ ̶t̶o̶ ̶t̶h̶i̶s̶ ̶P̶R̶ ̶I̶ ̶s̶u̶p̶p̶o̶s̶e̶.̶ |
Also I'm super late to the party, but what is the benefit of NOCATs over One-Hot-Encoding the categories? |
One-hot encoding only allows you to split off 1-vs-the-rest, whereas the optimal split for a categorical variable may be many-vs-many. For example, the optimal split at a given node may be:
but one-hot encoding would only be able to yield one of
This obviously affects the depth / number of splits that are necessary to get a similarly good result. |
@NicolasHug this is only one benchmark, but at least on this dataset, there are benefits to using NOCATS: #12866 (comment) |
That is the exact solution for regression and some binary cases. Thought I imagine you're doing this based on the unnormalized probabilities, which is one more level of indirection compared with the trees. |
actually, because you're doing a regression tree each time, the sorting my always be exact, depending on the loss. I need to think about that again and look at the formula. |
@adrinjalali |
We probably are going to implement this (or a version of it) for the new HistGradientBoosting ones instead. Realistically, we should have it in 6 months or so. |
Great to hear, thanks! |
@adrinjalali Any idea of an ETA on this? Just planning a few projects and this feature would be great to have! |
we're planning to have a version of this, probably for |
As someone who regulary comes back to this PR to check the status of NOCATS, I looked around what's currently happening - maybe others who have "only" subscribed to this thread are interested too: Following the "Categorical" project that this PR is a part of, I found #15550, which fit well with the quote above:
This was being worked on in #16909, but - unsurprisingly, considering the current state of the world - was delayed and is at risk to get included for 0.24. The last proposal from @NicolasHug was to continue with a pared-down (non-C++) version for 0.24, which is currently being worked on in #18394. In any case, very happy to see this moving forward! |
+1 to see this in a release. catboost and other tree based libraries have good categorical handling but would like to see sklearn handle this out of the box |
For everyone interested: version 0.24 was just released with categorical support in the |
Hello, |
Another drawback of One-Hot-Encoding is when the categorical feature to be encoded has a lot of possible values. This results in a large set of One-Hot features. So if a tree picks randomly a subset of the features for splitting, it is more likely that these one-hot-encoded features be picked up in comparison to the original features. |
Hi I'm wondering if random forest has supported the categorical data? |
Hallo, I would like to inquire about the status of this branch. My team would really benefit from this and be free from recurring to |
@AndreaTrucchia have you checked |
@adrinjalali I am checking it, too bad most of my works concern Random Forest. However, I think that I can give it a try for |
Out of curiosity, do the preprocessing techniques we have to handle categorical variables not satisfy your needs in a |
Dear @adrinjalali , while in a scikit-learn environment, I tend to one-hot-encode the categorical variable, with very high performances (see e.g. https://www.mdpi.com/2571-6255/5/1/30) .However, in the R (randomForest package) -style of treating canonical variables, I can use the partialPlot function that can rank the variables from let's say 1 "this category enhance the calssification of being label A" to -1 "this category strongly disagrees with classification of label A" . |
Isn't You could pass a pipeline with the |
That would probably be a different thing. Our PDP support is only defined for regressors, not classifiers. The "partial dependence" as we support it is defined as the expectation of a continuous target. |
This PR continues the work of #4899. For now I've merged the master into the PR, made it compile and make the tests run. There are several issues which need to be fixed. The list will be updated as I encounter them. Also, not all of these items are necessarily open, I have only collected them from the comments on the original PR, and need to make sure they're either already addressed or address them.
almost_equal
tree/tests
doneensemble/tests
doneCloses #4899
Future Work: These are the possible future work we already know of (i.e. outside the scope of this PR):
[0, max(feature)]
P.S. I moved away from "task list" due to the extremely buggy interface when used in combination with editing the post, which I'm extensively doing to keep it easy for us to keep up with the status.