distributed-training

Please can you train ghostnet.
(i don't have the imagenet dataset)

We would like to forward a particular 'key' column which is part of the features to appear alongside the predictions - this is to be able to identify to which set of features a particular prediction belongs to. Here is an example of predictions output using the tensorflow.contrib.estimator.multi_class_head:

{"classes": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"],
 "scores": [0.068196

I have the same hardware envs, same network, but I could not get the result as you, almost half as you. Any best practices and experience? thanks very much! for bytePS with 1 instance and 8 GPU, I have similar testing result.

Currently, AdaptDL will change the batch size whenever:

The job is restarted.
A new epoch is started.

This can cause the batch size to fluctuate frequently.

Instead, we should only change the batch size if the new batch size will cause a noticeable improvement in the predicted speedup (e.g. by 5% or more). Also consider adding a penalty term when finding the preferred batch size to

distributed-training

Here are 44 public repositories matching this topic...

PaddlePaddle / Paddle

rwightman / pytorch-image-models

Feature request : ghostnet

tensorflow / adanet

Allow one to forward features to predictions

bytedance / byteps

How did you get the horovod & bytePS performance

tensorlayer / hyperpose

determined-ai / determined

learning-at-home / hivemind

awslabs / deeplearning-cfn

wenwei202 / terngrad

dougsouza / pytorch-sync-batchnorm-example

lsds / KungFu

maudzung / YOLO3D-YOLOv4-PyTorch

petuum / adaptdl

Batch size should only change if speedup can be significantly improved

Short-circuit allocation when new job is immediately schedulable

Support new torchtext data loading

synxlin / deep-gradient-compression

bryanyzhu / Video-Tutorial-CVPR2020

awslabs / dynamic-training-with-apache-mxnet-on-aws

bindog / pytorch-model-parallel

Accenture / mercury

bytedance / ps-lite

richardkxu / distributed-pytorch

aws-samples / TensorFlow-in-SageMaker-workshop

Azure / DistributedDeepLearning

graykode / horovod-ansible

Shenggan / DeepCell-Keras

jiankaiwang / distributed_training

hysts / pytorch_yolov3

ZJU-OpenKS / OpenKS

JYWa / MATCHA

asprenger / distributed-training-patterns

erfannoury / cifar-tf

Improve this page

Add this topic to your repo