Skip to content

cannot call rpc.init_rpc twice within a single process #46491

Open
@froody

Description

@froody

🐛 Bug

Initializing rpc twice within the same process fails

To Reproduce

Steps to reproduce the behavior:

  1. python rpc_init.py

actual output:

Traceback (most recent call last):
  File "rpc_bug.py", line 30, in <module>
    mp.spawn(worker, args=(2,), nprocs=2, join=True)
  File "/private/home/tbirch/.conda/envs/torch160-src/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/private/home/tbirch/.conda/envs/torch160-src/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/private/home/tbirch/.conda/envs/torch160-src/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/private/home/tbirch/.conda/envs/torch160-src/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/private/home/tbirch/src/fairscale/rpc_bug.py", line 26, in worker
    init_rpc(rank, world_size, 1)
  File "/private/home/tbirch/src/fairscale/rpc_bug.py", line 15, in init_rpc
    rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(init_method=init_method),
  File "/private/home/tbirch/.conda/envs/torch160-src/lib/python3.7/site-packages/torch/distributed/rpc/__init__.py", line 86, in init_rpc
    dist_autograd._init(rank)
RuntimeError: Container is already initialized! Cannot initialize it twice!

Other error:

  1. Change init_rpc(rank, world_size, 1) to init_rpc(rank, world_size, 0)
  2. python init_rpc.py
Traceback (most recent call last):
  File "rpc_bug.py", line 30, in <module>
    mp.spawn(worker, args=(2,), nprocs=2, join=True)
  File "/private/home/tbirch/.conda/envs/torch160/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/private/home/tbirch/.conda/envs/torch160/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/private/home/tbirch/.conda/envs/torch160/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/private/home/tbirch/.conda/envs/torch160/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/private/home/tbirch/src/fairscale/rpc_bug.py", line 26, in worker
    init_rpc(rank, world_size, 0)
  File "/private/home/tbirch/src/fairscale/rpc_bug.py", line 15, in init_rpc
    rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(init_method=init_method),
  File "/private/home/tbirch/.conda/envs/torch160/lib/python3.7/site-packages/torch/distributed/rpc/__init__.py", line 78, in init_rpc
    store, _, _ = next(rendezvous_iterator)
  File "/private/home/tbirch/.conda/envs/torch160/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 126, in _tcp_rendezvous_handler
    store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)
RuntimeError: Address already in use

Expected behavior

process exits cleanly

Environment

Collecting environment information...
PyTorch version: 1.6.0a0+b31f58d
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.18.0

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.105
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100

Nvidia driver version: 418.116.00
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] torch==1.6.0a0+d3e3dec
[pip3] torchtext==0.8.0a0+e9b711b
[pip3] torchvision==0.7.0
[conda] magma-cuda101             2.5.2                         1    pytorch
[conda] mkl                       2020.1                      217
[conda] mkl-include               2020.2                      256    conda-forge
[conda] numpy                     1.19.1           py37h8960a57_0    conda-forge
[conda] torch                     1.6.0a0+b31f58d          pypi_0    pypi
[conda] torchtext                 0.8.0a0+e9b711b          pypi_0    pypi
[conda] torchvision               0.7.0                    pypi_0    pypi

cc @ezyang @gchanan @zou3519 @bdhirsh @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @jjlilley @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @heitorschueroff @xush6528

Metadata

Metadata

Assignees

No one assigned

    Labels

    high prioritymodule: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizeroncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions