preprocessing

Hello everyone,

First of all, I want to take a moment to thank all contributors and people who supported this project in any way ;) you are awesome!

If you like the project and have any interest in contributing/maintaining it, you can contact me here or send me a msg privately:

Email: nidhalbacc@gmail.com

PS: You need to be familiar with python and machine learning

Current version of bucketize uses fixed boundaries. If the user doesn't know these boundaries they need to calculate them using cudf.

We should support splitting continuous variables into buckets based on quantile and uniform splits of the data.

For uniform splits the statistics gathering phase needs to compute the min and max of the column and figure out the boundaries to create N buckets.

Everyone is welcome to add more algorithms to this project. This repo is new so we need contributions from all.

Write unit test coverage for SafeDataset and SafeDataLoader, along with the functions in utils.py.

BaseColumn::genericUnaryUDF
BaseColumn::genericBinaryUDF
BaseColumn::genericTrinaryUDF

https://github.com/facebookresearch/torcharrow/blob/main/csrc/velox/column.h#L364-L377

This is in the Eager Mode/Velox Backend.

The generic UDF call methods should be general enough to not be bound to any columns. For example when there are no arguments or all arguments are scalars, conceptual

I recently ran the build with the stylecheck and found out a significant number of warnings.
We need to do the following

Enable style check for every build
Fix the current warnings

Is your feature request related to a problem? Please describe.
Change split values from all caps to lower case.
This makes file/directory naming more consistent with the split.

Describe the solution you'd like
TRAIN -> train
VALIDATION -> validation
TEST -> test

Describe alternatives you've considered

No change

There's a bit of skew when it comes to mapping split val

preprocessing

Here are 863 public repositories matching this topic...

nidhaloff / igel

AxeldeRomblay / MLBox

OpenGene / fastp

dongrixinyu / JioNLP

NVIDIA-Merlin / NVTabular

KinWaiCheuk / nnAudio

TheAlgorithms / R

winedarksea / AutoTS

zzachw / PyHealth

msamogh / nonechucks

MaxHalford / xam

facebookresearch / torcharrow

R1j1t / contextualSpellCheck

cylondata / cylon

ikegami-yukino / neologdn

advaitsave / Introduction-to-Time-Series-forecasting-Python

nlpcl-lab / ace2005-preprocessing

ikegami-yukino / jaconv

qd-cae / awesome-CAE

DataCanvasIO / HyperGBM

Deffro / text-preprocessing-techniques

google / tensorflow-recorder

ropensci / MODIStsp

chakki-works / chariot

lozuwa / impy

mlr-org / mlr3pipelines

autoreject / autoreject

githubharald / DeslantImg

KananVyas / BoxDetection

quqixun / BrainPrep

Improve this page

Add this topic to your repo