Open
Description
🐛 Bug
Initializing rpc twice within the same process fails
To Reproduce
Steps to reproduce the behavior:
- python rpc_init.py
actual output:
Traceback (most recent call last):
File "rpc_bug.py", line 30, in <module>
mp.spawn(worker, args=(2,), nprocs=2, join=True)
File "/private/home/tbirch/.conda/envs/torch160-src/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/private/home/tbirch/.conda/envs/torch160-src/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/private/home/tbirch/.conda/envs/torch160-src/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/private/home/tbirch/.conda/envs/torch160-src/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/private/home/tbirch/src/fairscale/rpc_bug.py", line 26, in worker
init_rpc(rank, world_size, 1)
File "/private/home/tbirch/src/fairscale/rpc_bug.py", line 15, in init_rpc
rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(init_method=init_method),
File "/private/home/tbirch/.conda/envs/torch160-src/lib/python3.7/site-packages/torch/distributed/rpc/__init__.py", line 86, in init_rpc
dist_autograd._init(rank)
RuntimeError: Container is already initialized! Cannot initialize it twice!
Other error:
- Change
init_rpc(rank, world_size, 1)
toinit_rpc(rank, world_size, 0)
- python init_rpc.py
Traceback (most recent call last):
File "rpc_bug.py", line 30, in <module>
mp.spawn(worker, args=(2,), nprocs=2, join=True)
File "/private/home/tbirch/.conda/envs/torch160/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/private/home/tbirch/.conda/envs/torch160/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/private/home/tbirch/.conda/envs/torch160/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/private/home/tbirch/.conda/envs/torch160/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/private/home/tbirch/src/fairscale/rpc_bug.py", line 26, in worker
init_rpc(rank, world_size, 0)
File "/private/home/tbirch/src/fairscale/rpc_bug.py", line 15, in init_rpc
rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(init_method=init_method),
File "/private/home/tbirch/.conda/envs/torch160/lib/python3.7/site-packages/torch/distributed/rpc/__init__.py", line 78, in init_rpc
store, _, _ = next(rendezvous_iterator)
File "/private/home/tbirch/.conda/envs/torch160/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 126, in _tcp_rendezvous_handler
store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Expected behavior
process exits cleanly
Environment
Collecting environment information...
PyTorch version: 1.6.0a0+b31f58d
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.18.0
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.105
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100
Nvidia driver version: 418.116.00
cuDNN version: Could not collect
Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] torch==1.6.0a0+d3e3dec
[pip3] torchtext==0.8.0a0+e9b711b
[pip3] torchvision==0.7.0
[conda] magma-cuda101 2.5.2 1 pytorch
[conda] mkl 2020.1 217
[conda] mkl-include 2020.2 256 conda-forge
[conda] numpy 1.19.1 py37h8960a57_0 conda-forge
[conda] torch 1.6.0a0+b31f58d pypi_0 pypi
[conda] torchtext 0.8.0a0+e9b711b pypi_0 pypi
[conda] torchvision 0.7.0 pypi_0 pypi
cc @ezyang @gchanan @zou3519 @bdhirsh @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @jjlilley @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @heitorschueroff @xush6528