The PyTorch Implementation based on YOLOv4 of the paper: Complex-YOLO: Real-time 3D Object Detection on Point Clouds
- Realtime 3D object detection based on YOLOv4
- Distributed Data Parallel Training
- TensorboardX
- Try to use CIoU / GIoU loss for optimization.
pip install -U -r requirements.txt
For mayavi
and shapely
libraries, please refer to the installation instructions from their official websites.
Download the 3D KITTI detection dataset from here.
The downloaded data includes:
- Velodyne point clouds (29 GB): input data to the Complex-YOLO model
- Training labels of object data set (5 MB): input label to the Complex-YOLO model
- Camera calibration matrices of object data set (16 MB): for visualization of predictions
- Left color images of object data set (12 GB): for visualization of predictions
Please make sure that you construct the source code & dataset directories structure as below.
For 3D point cloud preprocessing, please refer to the previous works:
This work has been based on YOLOv4 for 2D object detection. Please refer to the original paper of YOLOv4 and the Pytorch implementation which is the great work from Tianxiaomo
cd src/data_process
python kitti_dataloader.py --batch_size 1 --num_workers 1
python test.py --gpu_idx 0 --pretrained_path <paths>
The trained model will be provided soon. Please watch the repo to get notifications for next update.
python train.py --gpu_idx 0 --multiscale_training
We should always use the nccl
backend for multi-processing distributed training since it currently provides the best
distributed training performance.
- Single machine (node), multiple GPUs
python train.py --dist-url 'tcp://127.0.0.1:29500' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0
- Two machines (two nodes), multiple GPUs
First machine
python train.py --dist-url 'tcp://IP_OF_NODE1:FREEPORT' --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 0
Second machine
python train.py --dist-url 'tcp://IP_OF_NODE2:FREEPORT' --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 1
To reproduce the results, you can run the bash shell script
./train.sh
python eval_mAP.py
The comparison of this implementation with Complex-YOLOv2, Complex-YOLOv3 will be updated soon.
mAP Comparison (min 0.50 IoU)
Model/Class | Car | Pedestrian | Cyclist | Average |
---|---|---|---|---|
Complex-YOLO-v2 | ||||
Complex-YOLO-v3 | ||||
Complex-YOLO-v4 |
Backbone | Detector | |
---|---|---|
BoF | [x] Dropblock [x] Random rescale, rotation (global) |
[x] Cross mini-Batch Normalization [x] Dropblock [x] Random traing shapes |
BoS | [x] Mish activation [x] Cross-stage partial connections (CSP) [x] Multi-input weighted residual connections (MiWRC) |
[x] Mish activation [x] SPP-block [x] SAM-block [x] PAN path-aggregation block [ ] CIoU/GIoU loss |
If you think this work is useful, please give me a star!
If you find any errors or have any suggestions, please contact me (Email: nguyenmaudung93.kstn@gmail.com
).
Thank you!
@article{Complex-YOLO,
author = {Martin Simon, Stefan Milz, Karl Amende, Horst-Michael Gross},
title = {Complex-YOLO: Real-time 3D Object Detection on Point Clouds},
year = {2018},
journal = {arXiv},
}
@article{YOLOv4,
author = {Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao},
title = {YOLOv4: Optimal Speed and Accuracy of Object Detection},
year = {2020},
journal = {arXiv},
}
${ROOT}
└── dataset/
└── kitti
├──ImageSets/
| ├── train.txt
| ├── val.txt
├── training/
| ├── image_2/ <-- for visualization
| ├── calib/
| ├── label_2/
| ├── velodyne/
└── testing/
| ├── image_2/ <-- for visualization
| ├── calib/
| ├── velodyne/
└── src/
└── config/
└── data_process/
└── models/
└── utils/
└── demo.py
└── eval_mAP.py
└── test.py
└── train.py
└── train.sh
├── README.md
├── requirements.txt
usage: train.py [-h] [--seed SEED] [--saved_fn FN] [-a ARCH] [--cfgfile PATH]
[--pretrained_path PATH] [--img_size IMG_SIZE]
[--multiscale_training] [--no-val] [--num_samples NUM_SAMPLES]
[--num_workers NUM_WORKERS] [--batch_size BATCH_SIZE]
[--subdivisions SUBDIVISIONS] [--print_freq N]
[--tensorboard_freq N] [--checkpoint_freq N] [--start_epoch N]
[--num_epochs N] [--lr LR] [--minimum_lr MIN_LR]
[--momentum M] [-wd WD] [--optimizer_type OPTIMIZER]
[--lr_type SCHEDULER] [--burn_in N]
[--steps [STEPS [STEPS ...]]] [--world-size N] [--rank N]
[--dist-url DIST_URL] [--dist-backend DIST_BACKEND]
[--gpu_idx GPU_IDX] [--no_cuda]
[--multiprocessing-distributed] [--evaluate]
[--resume_path PATH]
The Implementation of Complex YOLOv4
optional arguments:
-h, --help show this help message and exit
--seed SEED re-produce the results with seed random
--saved_fn FN The name using for saving logs, models,...
-a ARCH, --arch ARCH The name of the model architecture
--cfgfile PATH The path for cfgfile (only for darknet)
--pretrained_path PATH
the path of the pretrained checkpoint
--img_size IMG_SIZE the size of input image
--multiscale_training
If true, use scaling data for training
--no-val If true, dont evaluate the model on the val set
--num_samples NUM_SAMPLES
Take a subset of the dataset to run and debug
--num_workers NUM_WORKERS
Number of threads for loading data
--batch_size BATCH_SIZE
mini-batch size (default: 64), this is the totalbatch
size of all GPUs on the current node when usingData
Parallel or Distributed Data Parallel
--subdivisions SUBDIVISIONS
subdivisions during training
--print_freq N print frequency (default: 10)
--tensorboard_freq N frequency of saving tensorboard (default: 10)
--checkpoint_freq N frequency of saving checkpoints (default: 3)
--start_epoch N the starting epoch
--num_epochs N number of total epochs to run
--lr LR initial learning rate
--minimum_lr MIN_LR minimum learning rate during training
--momentum M momentum
-wd WD, --weight_decay WD
weight decay (default: 1e-6)
--optimizer_type OPTIMIZER
the type of optimizer, it can be sgd or adam
--lr_type SCHEDULER the type of the learning rate scheduler (steplr or
ReduceonPlateau)
--burn_in N number of burn in step
--steps [STEPS [STEPS ...]]
number of burn in step
--world-size N number of nodes for distributed training
--rank N node rank for distributed training
--dist-url DIST_URL url used to set up distributed training
--dist-backend DIST_BACKEND
distributed backend
--gpu_idx GPU_IDX GPU index to use.
--no_cuda If true, cuda is not used.
--multiprocessing-distributed
Use multi-processing distributed training to launch N
processes per node, which has N GPUs. This is the
fastest way to use PyTorch for either single node or
multi node data parallel training
--evaluate only evaluate the model, not training
--resume_path PATH the path of the resumed checkpoint