nn.Parameter{List,Dict} not copied to gpus in forward pass when nn.DataParallel is used #36035
Comments
Seems like a correctness issue so I am marking it as high-pri. |
Looks like a bad interaction between ParameterList and the replicate function on nn.Module. This needs more investigation. |
Reducing the priority as many other things do not interact nicely with DataParallel. We can increase it again if we see many users hit this issue. |
Instead of using |
Now I directly pass nn.Parameter as a part of inputs into the model. import sys
import torch
import torch.nn as nn
import torch.nn.functional as F
gpus = list(map(int, sys.argv[1].split(',')))
class Net(nn.Module):
def __init__(self):
super().__init__()
self.alpha = nn.ParameterList()
for i in range(2):
self.alpha.append(nn.Parameter(1e-3*torch.randn(i+2, 5)))
print('Init: ', [a for a in self.alpha])
def forward(self, x, alphas):
# print('Inputs: ', x.shape)
if alphas is not None:
alphas = [a.squeeze(0) for a in alphas]
print('In forward pass: ', [a for a in alphas])
print([a.shape for a in alphas])
else:
print('In forward pass: ', [a.shape for a in self.alpha])
return x
if __name__ == '__main__':
net = Net().cuda()
if len(gpus) > 1:
net = nn.DataParallel(net, device_ids=gpus)
alphas = [a.unsqueeze(0).repeat(len(gpus),1,1) for a in net.module.alpha]
print([a.shape for a in alphas])
else:
alphas = None
net(torch.rand(4, 5), alphas)
# print('Not in forward pass: ', [n for n, p in net.named_parameters()]) If I run
This approach can work. |
It doesn't work in this way. |
Same problem here using nn.Parameter() with DataParallel |
Same problem here, during the forward pass of a module wrapped with DataParallel, the nn.ParameterList is empty. |
I also encountered the same problem. This problem also applies to |
Hi, The workaround right now is just to save the Parameter directly on the module and not in the dict. Bumping priority as it seem quite common |
nb that if at all possible, you should using DistributedDataParallel instead! |
@ezyang I'll work on this issue. As far as I understand from the code of pytorch/torch/nn/parallel/replicate.py Line 146 in e440c37 as seems like we cannot add parameters with setattr for ParameterList or ParameterDict .Replacing it with if isinstance(replica, (ParameterList, ParameterDict)):
replica._parameters[str(key)] = Parameter(param)
else:
setattr(replica, key, param) propagates parameters from list/dict to replicas. What do you think about such solution ? Otherwise, could you please hint another way to fix the problem. Thanks ! |
Should they actually be Parameters? The comment seem to hint at the fact that they should not. |
@albanD seems like we cannot add non- In |
- added test replicate on parameter, list, dict
I agree that this would solve this issue but I wonder if it's not going to cause more issues when other part of torch.nn interact with it. I don't think we want to "pretend" these are Parameters even though they are Tensors with history. Won't One thing I can thing of would be to allow ParameterList to contain elements that are not Parameters (like any other nn.Module). These would be added via a special API and won't be registered in the _parameters dict. |
Thanks for pointing this out ! Yes, we pick this tensors up.
I think, the problem will then appear from the user side when he/she would like to access the elements of ParameterList... |
Ho yes sorry, this would imply modifying the getitem to return the elements from either _parameters or regular attributes (saved in another dict I guess). |
Just checked DDP case single process multiple-devices use-case (#36503) and same issue persist for ParameterList/Dict. Code and output > python -m torch.distributed.launch --use_env --nproc_per_node=1 ddp_parameters.py
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
class Net(nn.Module):
def __init__(self):
super().__init__()
self.beta = nn.Parameter(torch.tensor(0.0))
self.alpha = nn.ParameterList([nn.Parameter(torch.tensor(1.0)), nn.Parameter(torch.tensor(2.0))])
self.gamma = nn.ParameterDict({"a": nn.Parameter(torch.tensor(3.0)), "b": nn.Parameter(torch.tensor(4.0))})
class NestedNet(nn.Module):
def __init__(self):
super().__init__()
self.base = Net()
self.main_beta = nn.Parameter(torch.tensor(10.0))
self.main_alpha = nn.ParameterList([nn.Parameter(torch.tensor(11.0)), nn.Parameter(torch.tensor(12.0))])
self.main_gamma = nn.ParameterDict({"A": nn.Parameter(torch.tensor(13.0)), "B": nn.Parameter(torch.tensor(14.0))})
def forward(self, x):
print(dist.get_rank(), "- forward", self.main_beta)
print(dist.get_rank(), "-- forward", self.main_alpha)
print(dist.get_rank(), "--- forward", self.base.alpha)
if __name__ == '__main__':
import os
dist.init_process_group("gloo", init_method="env://")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
net = NestedNet().cuda()
ddp_net = DDP(net, device_ids=[0, 1])
if dist.get_rank() == 0:
print("ddp_net parameters:", list(ddp_net.parameters()))
ddp_net(torch.rand(2, 4)) Output
User can fetch correctly all parameters with |
This is a serious issue preventing us from upgrading to 1.6. Working around ParameterList (e.g., assign directly as attributes) is non-ideal as it breaks loading prior checkpoints. Adding glue code to load prior checkpoints would work, but it's something I don't expect from a stable release. Also related to issue #42327. |
This seems to be another error instances caused by the changes we made in #33907? If the goal is to access parameters in replicated models, it will no longer work after #33907. I temporary work-around is to access them from @ngimel do we have a long-term solution for this? This error seems to emerge quite frequently. |
@mrshenli not really. In case of ParameterList/ParameterDict #33907 exposed a deficiency in ParameterList/ParameterDict design that already made them incompatible with reparameterization, see e.g. the following example where ParameterList fails without any data_parallel replication:
Note that while one should not access |
@mrshenli I see. As far as I understand import torch
import torch.nn as nn
m = nn.Linear(10, 10)
m2 = torch.nn.utils.weight_norm(m, name="weight")
print(type(m2.weight), type(m2.weight_g), type(m2.weight_v))
> <class 'torch.Tensor'> <class 'torch.nn.parameter.Parameter'> <class 'torch.nn.parameter.Parameter'> but for import torch
import torch.nn as nn
plist = nn.ParameterList([nn.Parameter(torch.randn(10, 10)) for i in range(2)])
plist2 = torch.nn.utils.weight_norm(plist, name="0")
print(["{}:{}".format(n, p.shape) for n, p in plist2.named_parameters()])
type(plist2[0])
> ['1:torch.Size([10, 10])', '0_g:torch.Size([10, 1])', '0_v:torch.Size([10, 10])']
> KeyError: '0' So, naively, the fix for |
Running into the same problem here. Seems like a fairly annoying issue. And the error messages are unhelpful. The data in the ParameterList/Dict just disappears at some point during the forward pass. |
Yeah, this is a serious issue especially for meta learning which changes the network parameter online according to the parameter list. |
I'm having the same problem. ParameterList appears empty in the forward pass. Perhaps the workaround could help, but it's far from ideal. |
Encountered the same problem, which slows down the training time a lot. |
Could the fix for this please please be prioritized? Its been half a year.. |
@albanD sorry to bother you, do you guys have a timeline for when you think a fix could be integrated? I'm relatively new to ML and torch and this is a big blocker for me (the above proposed workaround isn't an option in my case). |
I also encountered the issue. Snippet to reproduce: class Dummy(nn.Module):
def __init__(self):
super().__init__()
self.params = nn.ParameterList()
self.params.append(nn.Parameter(torch.Tensor(3, 4)))
self.params.append(nn.Parameter(torch.Tensor(4, 5)))
def forward(self, x):
print(len(self.params), self.params)
return x Everything works as expected on a single GPU: >>> device = 'cuda:0'
>>> d = Dummy().to(device)
>>> x = torch.randn(32, 4, 4).to(device)
>>> r = d(x)
2 ParameterList(
(0): Parameter containing: [torch.cuda.FloatTensor of size 3x4 (GPU 0)]
(1): Parameter containing: [torch.cuda.FloatTensor of size 4x5 (GPU 0)]
) However, when using DataParallel and more than one device, the >>> device = 'cuda'
>>> d = nn.DataParallel(d).to(device)
>>> r = d(x)
00 ParameterList()
ParameterList() In addition, I get the warning:
|
@Devchonka I am afraid there is no one working on a fix for this at the moment. The main reason is that this is due to an issue in the original design of ParameterList/Dict and it cannot be easily fixed without having many side effects on other usage of these wrappers. |
follow suggestions from pytorch/pytorch#36035 (comment)
same error |
When I use nn.DataParallel to wrap an nn.Module X, nn.Parameter in X is not copied to gpus in the forward pass. I think nn.Parameter can be considered as a part of module parameters, so it should be treated like other nn.Module parameters in X as well. Is it an intentional design?
To Reproduce
test.py:
When I run
python3 test.py 0
(which means device_id = [0]), the output isHowever, when I run
python3 test.py 0,1
(which means device_id = [0, 1]), the output isOnly nn.Module is copied to gpus in forward pass.
How can I use and train nn.Parameter just like nn.Module with nn.DataParallel?
Expected behavior
When the nn.Module X is wrapped with nn.DataParallel, both nn.Module and nn.Parameter in X should be copied to gpus.
Environment
PyTorch version: 1.6.0.dev20200401+cu101
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: Arch Linux
GCC version: (Arch Linux 9.3.0-1) 9.3.0
CMake version: version 3.17.0
Python version: 3.8
Is CUDA available: Yes
CUDA runtime version: 10.2.89
GPU models and configuration:
GPU 0: GeForce GTX TITAN X
GPU 1: GeForce GTX 1060 6GB
Nvidia driver version: 440.64
cuDNN version: /usr/lib/libcudnn.so.7.6.5
Versions of relevant libraries:
[pip3] numpy==1.18.2
[pip3] torch==1.6.0.dev20200401+cu101
[pip3] torchexp==0.1.0
[pip3] torchvision==0.6.0.dev20200401+cu101
[conda] Could not collect
cc @ezyang @gchanan @zou3519 @albanD @mruberry
The text was updated successfully, but these errors were encountered: