Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nn.Parameter{List,Dict} not copied to gpus in forward pass when nn.DataParallel is used #36035

Open
grtzsohalf opened this issue Apr 5, 2020 · 39 comments · May be fixed by #40589
Open

nn.Parameter{List,Dict} not copied to gpus in forward pass when nn.DataParallel is used #36035

grtzsohalf opened this issue Apr 5, 2020 · 39 comments · May be fixed by #40589

Comments

@grtzsohalf
Copy link

@grtzsohalf grtzsohalf commented Apr 5, 2020

🐛 Bug

When I use nn.DataParallel to wrap an nn.Module X, nn.Parameter in X is not copied to gpus in the forward pass. I think nn.Parameter can be considered as a part of module parameters, so it should be treated like other nn.Module parameters in X as well. Is it an intentional design?

To Reproduce

test.py:

import sys
import torch
import torch.nn as nn
import torch.nn.functional as F


gpus = list(map(int, sys.argv[1].split(',')))


class Net(nn.Module):
    def __init__(self):
        super().__init__()

        self.alpha = nn.ParameterList()

        for i in range(4):
            self.alpha.append(nn.Parameter(1e-3*torch.randn(i+2, 5)))

        self.cnn = nn.Conv2d(1, 1, 1, 1, 1)


    def forward(self, x):
        print(self.alpha)
        print(self.cnn)
        return x


if __name__ == '__main__':
    net = Net().cuda()
    if len(gpus) > 1:
        net = nn.DataParallel(net, device_ids=gpus)

    net(torch.rand(4, 5))

When I run python3 test.py 0 (which means device_id = [0]), the output is

ParameterList(
    (0): Parameter containing: [torch.cuda.FloatTensor of size 2x5 (GPU 0)]
    (1): Parameter containing: [torch.cuda.FloatTensor of size 3x5 (GPU 0)]
    (2): Parameter containing: [torch.cuda.FloatTensor of size 4x5 (GPU 0)]
    (3): Parameter containing: [torch.cuda.FloatTensor of size 5x5 (GPU 0)]
)
Conv2d(1, 1, kernel_size=(1, 1), stride=(1, 1), padding=(1, 1))

However, when I run python3 test.py 0,1 (which means device_id = [0, 1]), the output is

ParameterList()
Conv2d(1, 1, kernel_size=(1, 1), stride=(1, 1), padding=(1, 1))
ParameterList()
Conv2d(1, 1, kernel_size=(1, 1), stride=(1, 1), padding=(1, 1))

Only nn.Module is copied to gpus in forward pass.
How can I use and train nn.Parameter just like nn.Module with nn.DataParallel?

Expected behavior

When the nn.Module X is wrapped with nn.DataParallel, both nn.Module and nn.Parameter in X should be copied to gpus.

Environment

PyTorch version: 1.6.0.dev20200401+cu101
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Arch Linux
GCC version: (Arch Linux 9.3.0-1) 9.3.0
CMake version: version 3.17.0

Python version: 3.8
Is CUDA available: Yes
CUDA runtime version: 10.2.89
GPU models and configuration:
GPU 0: GeForce GTX TITAN X
GPU 1: GeForce GTX 1060 6GB

Nvidia driver version: 440.64
cuDNN version: /usr/lib/libcudnn.so.7.6.5

Versions of relevant libraries:
[pip3] numpy==1.18.2
[pip3] torch==1.6.0.dev20200401+cu101
[pip3] torchexp==0.1.0
[pip3] torchvision==0.6.0.dev20200401+cu101
[conda] Could not collect

cc @ezyang @gchanan @zou3519 @albanD @mruberry

@zou3519
Copy link
Contributor

@zou3519 zou3519 commented Apr 6, 2020

Seems like a correctness issue so I am marking it as high-pri.

@albanD
Copy link
Contributor

@albanD albanD commented Apr 6, 2020

Looks like a bad interaction between ParameterList and the replicate function on nn.Module. This needs more investigation.

@albanD
Copy link
Contributor

@albanD albanD commented Apr 6, 2020

Reducing the priority as many other things do not interact nicely with DataParallel. We can increase it again if we see many users hit this issue.

@grtzsohalf
Copy link
Author

@grtzsohalf grtzsohalf commented Apr 7, 2020

Thanks for your notice, @zou3519 @albanD .
For now, is there a recommended workaround to use nn.Parameter with nn.DataParallel?

@zou3519
Copy link
Contributor

@zou3519 zou3519 commented Apr 7, 2020

Instead of using nn.ParameterList, Does it work if you directly assign the parameters to the model?

@grtzsohalf
Copy link
Author

@grtzsohalf grtzsohalf commented Apr 7, 2020

Now I directly pass nn.Parameter as a part of inputs into the model.
test.py:

import sys
import torch
import torch.nn as nn
import torch.nn.functional as F


gpus = list(map(int, sys.argv[1].split(',')))


class Net(nn.Module):
    def __init__(self):
        super().__init__()

        self.alpha = nn.ParameterList()
        for i in range(2):
            self.alpha.append(nn.Parameter(1e-3*torch.randn(i+2, 5)))
        print('Init: ', [a for a in self.alpha])


    def forward(self, x, alphas):
        # print('Inputs: ', x.shape)
        if alphas is not None:
            alphas = [a.squeeze(0) for a in alphas]
            print('In forward pass: ', [a for a in alphas])
            print([a.shape for a in alphas])
        else:
            print('In forward pass: ', [a.shape for a in self.alpha])
        return x


if __name__ == '__main__':
    net = Net().cuda()
    if len(gpus) > 1:
        net = nn.DataParallel(net, device_ids=gpus)
        alphas = [a.unsqueeze(0).repeat(len(gpus),1,1) for a in net.module.alpha]
        print([a.shape for a in alphas])
    else:
        alphas = None

    net(torch.rand(4, 5), alphas)
    # print('Not in forward pass: ', [n for n, p in net.named_parameters()])

If I run python3 test.py 0,1, then I get:

Init:  [Parameter containing:
tensor([[-0.0011,  0.0020,  0.0002, -0.0011,  0.0001],
        [ 0.0008,  0.0002,  0.0022,  0.0007, -0.0011]], requires_grad=True), Parameter containing:
tensor([[ 0.0010, -0.0008,  0.0010,  0.0003, -0.0013],
        [-0.0005, -0.0005, -0.0008,  0.0005,  0.0003],
        [ 0.0003,  0.0016,  0.0003, -0.0002,  0.0009]], requires_grad=True)]

[torch.Size([2, 2, 5]), torch.Size([2, 3, 5])]

In forward pass:  In forward pass:  [tensor([[-0.0011,  0.0020,  0.0002, -0.0011,  0.0001],
        [ 0.0008,  0.0002,  0.0022,  0.0007, -0.0011]], device='cuda:0',
       grad_fn=<SqueezeBackward1>), tensor([[ 0.0010, -0.0008,  0.0010,  0.0003, -0.0013],
        [-0.0005, -0.0005, -0.0008,  0.0005,  0.0003],
        [ 0.0003,  0.0016,  0.0003, -0.0002,  0.0009]], device='cuda:0',
       grad_fn=<SqueezeBackward1>)]
[torch.Size([2, 5]), torch.Size([3, 5])]
[tensor([[-0.0011,  0.0020,  0.0002, -0.0011,  0.0001],
        [ 0.0008,  0.0002,  0.0022,  0.0007, -0.0011]], device='cuda:1',
       grad_fn=<SqueezeBackward1>), tensor([[ 0.0010, -0.0008,  0.0010,  0.0003, -0.0013],
        [-0.0005, -0.0005, -0.0008,  0.0005,  0.0003],
        [ 0.0003,  0.0016,  0.0003, -0.0002,  0.0009]], device='cuda:1',
       grad_fn=<SqueezeBackward1>)]
[torch.Size([2, 5]), torch.Size([3, 5])]

This approach can work.

@grtzsohalf
Copy link
Author

@grtzsohalf grtzsohalf commented Apr 7, 2020

Instead of using nn.ParameterList, Does it work if you directly assign the parameters to the model?

It doesn't work in this way.

@lidazhang
Copy link

@lidazhang lidazhang commented Apr 14, 2020

Same problem here using nn.Parameter() with DataParallel

@cyranawm
Copy link

@cyranawm cyranawm commented May 29, 2020

Same problem here, during the forward pass of a module wrapped with DataParallel, the nn.ParameterList is empty.

@ruoshiliu
Copy link

@ruoshiliu ruoshiliu commented Jun 19, 2020

I also encountered the same problem. This problem also applies to ParameterDict(). Is there any workaround solution for this issue?

@albanD
Copy link
Contributor

@albanD albanD commented Jun 19, 2020

Hi,

The workaround right now is just to save the Parameter directly on the module and not in the dict.

Bumping priority as it seem quite common

@ezyang ezyang changed the title nn.Parameter not copied to gpus in forward pass when nn.DataParallel is used nn.Parameter{List,Dict} not copied to gpus in forward pass when nn.DataParallel is used Jun 22, 2020
@ezyang ezyang added small and removed triage review labels Jun 22, 2020
@ezyang
Copy link
Contributor

@ezyang ezyang commented Jun 22, 2020

nb that if at all possible, you should using DistributedDataParallel instead!

@vfdev-5
Copy link
Collaborator

@vfdev-5 vfdev-5 commented Jun 25, 2020

@ezyang I'll work on this issue.

As far as I understand from the code of replicate method of nn/parallel/replicate.py, I think the problem is with this line:

setattr(replica, key, param)

as seems like we cannot add parameters with setattr for ParameterList or ParameterDict.
Replacing it with

    if isinstance(replica, (ParameterList, ParameterDict)):
        replica._parameters[str(key)] = Parameter(param)
    else:
        setattr(replica, key, param)

propagates parameters from list/dict to replicas. What do you think about such solution ? Otherwise, could you please hint another way to fix the problem. Thanks !

@albanD
Copy link
Contributor

@albanD albanD commented Jun 25, 2020

Should they actually be Parameters? The comment seem to hint at the fact that they should not.

@vfdev-5
Copy link
Collaborator

@vfdev-5 vfdev-5 commented Jun 25, 2020

@albanD seems like we cannot add non-Parameter object to replica._parameters[str(key)] = ...

In replica, attribute _parameter is set as empty ordered dict: replica._parameters = OrderedDict() inside Module._replicate_for_data_parallel.

vfdev-5 pushed a commit to Quansight/pytorch that referenced this issue Jun 26, 2020
- added test replicate on parameter, list, dict
@albanD
Copy link
Contributor

@albanD albanD commented Jun 26, 2020

I agree that this would solve this issue but I wonder if it's not going to cause more issues when other part of torch.nn interact with it. I don't think we want to "pretend" these are Parameters even though they are Tensors with history. Won't .parameters() pick up these new Tensors as it traverse all the _parameters dict?

One thing I can thing of would be to allow ParameterList to contain elements that are not Parameters (like any other nn.Module). These would be added via a special API and won't be registered in the _parameters dict.

@vfdev-5
Copy link
Collaborator

@vfdev-5 vfdev-5 commented Jun 26, 2020

Won't .parameters() pick up these new Tensors as it traverse all the _parameters dict?

Thanks for pointing this out ! Yes, we pick this tensors up.

One thing I can thing of would be to allow ParameterList to contain elements that are not Parameters (like any other nn.Module). These would be added via a special API and won't be registered in the _parameters dict.

I think, the problem will then appear from the user side when he/she would like to access the elements of ParameterList...

@albanD
Copy link
Contributor

@albanD albanD commented Jun 26, 2020

I think, the problem will then appear from the user side when he/she would like to access the elements of ParameterList...

Ho yes sorry, this would imply modifying the getitem to return the elements from either _parameters or regular attributes (saved in another dict I guess).

@vfdev-5
Copy link
Collaborator

@vfdev-5 vfdev-5 commented Jun 29, 2020

Just checked DDP case single process multiple-devices use-case (#36503) and same issue persist for ParameterList/Dict.

Code and output

> python -m torch.distributed.launch --use_env --nproc_per_node=1 ddp_parameters.py

import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP


class Net(nn.Module):
    
    def __init__(self):
        super().__init__()
        
        self.beta = nn.Parameter(torch.tensor(0.0))
        self.alpha = nn.ParameterList([nn.Parameter(torch.tensor(1.0)), nn.Parameter(torch.tensor(2.0))])
        self.gamma = nn.ParameterDict({"a": nn.Parameter(torch.tensor(3.0)), "b": nn.Parameter(torch.tensor(4.0))})


class NestedNet(nn.Module):
    
    def __init__(self):
        super().__init__()
        
        self.base = Net()

        self.main_beta = nn.Parameter(torch.tensor(10.0))
        self.main_alpha = nn.ParameterList([nn.Parameter(torch.tensor(11.0)), nn.Parameter(torch.tensor(12.0))])
        self.main_gamma = nn.ParameterDict({"A": nn.Parameter(torch.tensor(13.0)), "B": nn.Parameter(torch.tensor(14.0))})

    def forward(self, x):
        print(dist.get_rank(), "- forward", self.main_beta)
        print(dist.get_rank(), "-- forward", self.main_alpha)
        print(dist.get_rank(), "--- forward", self.base.alpha)


if __name__ == '__main__':
    import os    
    dist.init_process_group("gloo", init_method="env://")
    
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)
    net = NestedNet().cuda()
    ddp_net = DDP(net, device_ids=[0, 1])

    if dist.get_rank() == 0:
        print("ddp_net parameters:", list(ddp_net.parameters()))
    
    ddp_net(torch.rand(2, 4))

Output

ddp_net parameters: [Parameter containing:
tensor(10., device='cuda:0', requires_grad=True), Parameter containing:
tensor(0., device='cuda:0', requires_grad=True), Parameter containing:
tensor(1., device='cuda:0', requires_grad=True), Parameter containing:
tensor(2., device='cuda:0', requires_grad=True), Parameter containing:
tensor(3., device='cuda:0', requires_grad=True), Parameter containing:
tensor(4., device='cuda:0', requires_grad=True), Parameter containing:
tensor(11., device='cuda:0', requires_grad=True), Parameter containing:
tensor(12., device='cuda:0', requires_grad=True), Parameter containing:
tensor(13., device='cuda:0', requires_grad=True), Parameter containing:
tensor(14., device='cuda:0', requires_grad=True)]
0 - forward 0 - forward Parameter containing:
tensor(10., device='cuda:0', requires_grad=True)
0 -- forward ParameterList(
    (0): Parameter containing: [torch.cuda.FloatTensor of size  (GPU 0)]
    (1): Parameter containing: [torch.cuda.FloatTensor of size  (GPU 0)]
)
0 --- forward ParameterList(
    (0): Parameter containing: [torch.cuda.FloatTensor of size  (GPU 0)]
    (1): Parameter containing: [torch.cuda.FloatTensor of size  (GPU 0)]
)
tensor(10., device='cuda:1', requires_grad=True)
0 -- forward ParameterList()
0 --- forward ParameterList()

User can fetch correctly all parameters with .parameters(), but ParameterList/Dict are only present on replica 0.

@concretevitamin
Copy link

@concretevitamin concretevitamin commented Aug 2, 2020

This is a serious issue preventing us from upgrading to 1.6. Working around ParameterList (e.g., assign directly as attributes) is non-ideal as it breaks loading prior checkpoints. Adding glue code to load prior checkpoints would work, but it's something I don't expect from a stable release.

Also related to issue #42327.

@mrshenli
Copy link
Contributor

@mrshenli mrshenli commented Aug 3, 2020

This seems to be another error instances caused by the changes we made in #33907?

If the goal is to access parameters in replicated models, it will no longer work after #33907. I temporary work-around is to access them from _former_parameters as we did in DDP.

@ngimel do we have a long-term solution for this? This error seems to emerge quite frequently.

@vfdev-5
Copy link
Collaborator

@vfdev-5 vfdev-5 commented Aug 3, 2020

@mrshenli in this PR : #40589 I tried to propose a fix for this issue with DP and ParameterList/Dict. Could you please take a look and give a feedback ? Thanks

@ngimel
Copy link
Contributor

@ngimel ngimel commented Aug 3, 2020

@mrshenli not really. In case of ParameterList/ParameterDict #33907 exposed a deficiency in ParameterList/ParameterDict design that already made them incompatible with reparameterization, see e.g. the following example where ParameterList fails without any data_parallel replication:

import torch
import torch.nn as nn

class MyMod(torch.nn.Module):
    def __init__(self):
        super(MyMod, self).__init__()
        self.params = nn.ParameterList([nn.Parameter(torch.randn(10, 10)) for i in range(10)])

    def forward(self, x):
        print( self.params[0].size())
        # ParameterList can act as an iterable, or be indexed using ints
        for i, p in enumerate(self.params):
            x = self.params[i // 2].mm(x) + p.mm(x)
        return x

m = MyMod()
torch.nn.utils.weight_norm(m.params, name="0")
print("applied weight norm")
x = torch.randn(10, 10)
out=m(x)

Note that while one should not access parameters() in the forward pass (and this is the only thing that #33907 breaks), and accessing them by attribute name works even after #33907, accessing elements of ParameterList by index is the sole purpose of its existence, and the only way to access those attributes, and yet it does not work after a regular reparameterization. ParameterList/ParameterDict happened to work with data parallel replication before #33907 because it hid the fact that it was, in fact, reparameterization, but the proper fix is to make ParameterList/ParameterDict behave correctly in reparameterization scenarios.

@mrshenli
Copy link
Contributor

@mrshenli mrshenli commented Aug 3, 2020

Hey @vfdev-5, based on @ngimel's comments above, the fix for ParameterList should be general one that works for both with and without DataParallel?

@vfdev-5
Copy link
Collaborator

@vfdev-5 vfdev-5 commented Aug 3, 2020

@mrshenli I see. As far as I understand weight_norm issue with ParameterList is that we can not simply add and fetch tensors from those structures. For example, a regular module after weight_norm of a parameter contains 1 tensor and 2 parameters

import torch
import torch.nn as nn

m = nn.Linear(10, 10)
m2 = torch.nn.utils.weight_norm(m, name="weight")
print(type(m2.weight), type(m2.weight_g), type(m2.weight_v))
> <class 'torch.Tensor'> <class 'torch.nn.parameter.Parameter'> <class 'torch.nn.parameter.Parameter'>

but for ParameterList we lose original parameter and have 2 additional parameters:

import torch
import torch.nn as nn

plist = nn.ParameterList([nn.Parameter(torch.randn(10, 10)) for i in range(2)])
plist2 = torch.nn.utils.weight_norm(plist, name="0")
print(["{}:{}".format(n, p.shape) for n, p in plist2.named_parameters()])
type(plist2[0])
> ['1:torch.Size([10, 10])', '0_g:torch.Size([10, 1])', '0_v:torch.Size([10, 10])']
> KeyError: '0'

So, naively, the fix for ParameterList is to be able to to do setattr(parameter_list_module, name, tensor) and redirect
parameter_list_module[name] to getattr(parameter_list_module, name) and fetch the tensor from the module instead of parameter_list_module._parameters ?

@jacobaustin123
Copy link

@jacobaustin123 jacobaustin123 commented Sep 8, 2020

Running into the same problem here. Seems like a fairly annoying issue. And the error messages are unhelpful. The data in the ParameterList/Dict just disappears at some point during the forward pass.

@XinYuan-believe
Copy link

@XinYuan-believe XinYuan-believe commented Oct 8, 2020

Yeah, this is a serious issue especially for meta learning which changes the network parameter online according to the parameter list.

@vaxenburg
Copy link

@vaxenburg vaxenburg commented Oct 10, 2020

I'm having the same problem. ParameterList appears empty in the forward pass. Perhaps the workaround could help, but it's far from ideal.

@ymthink
Copy link

@ymthink ymthink commented Oct 14, 2020

Encountered the same problem, which slows down the training time a lot.

@Devchonka
Copy link

@Devchonka Devchonka commented Dec 14, 2020

Could the fix for this please please be prioritized? Its been half a year..

@Devchonka
Copy link

@Devchonka Devchonka commented Jan 29, 2021

@albanD sorry to bother you, do you guys have a timeline for when you think a fix could be integrated? I'm relatively new to ML and torch and this is a big blocker for me (the above proposed workaround isn't an option in my case).

@JeanKossaifi
Copy link

@JeanKossaifi JeanKossaifi commented Jan 30, 2021

I also encountered the issue.

Snippet to reproduce:

class Dummy(nn.Module):
    def __init__(self):
        super().__init__()
        self.params = nn.ParameterList()
        self.params.append(nn.Parameter(torch.Tensor(3, 4)))
        self.params.append(nn.Parameter(torch.Tensor(4, 5)))
        
    def forward(self, x):
        print(len(self.params), self.params)
        return x

Everything works as expected on a single GPU:

>>> device = 'cuda:0'
>>> d = Dummy().to(device)
>>> x = torch.randn(32, 4, 4).to(device)
>>> r = d(x)
2 ParameterList(
    (0): Parameter containing: [torch.cuda.FloatTensor of size 3x4 (GPU 0)]
    (1): Parameter containing: [torch.cuda.FloatTensor of size 4x5 (GPU 0)]
)

However, when using DataParallel and more than one device, the ParameterList is empty:

>>> device = 'cuda'
>>> d = nn.DataParallel(d).to(device)
>>> r = d(x)
00 ParameterList()
 ParameterList()

In addition, I get the warning:

.../torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.
  warnings.warn("Setting attributes on ParameterList is not supported.")
@albanD
Copy link
Contributor

@albanD albanD commented Feb 1, 2021

@Devchonka I am afraid there is no one working on a fix for this at the moment. The main reason is that this is due to an issue in the original design of ParameterList/Dict and it cannot be easily fixed without having many side effects on other usage of these wrappers.
The "simplest" workaround for now is not to use them unfortunately and save all your Parameters directly on the Module.
Or if you want to make your code cleaner, you can have a custom nn.Module that will hold them for you but you won't be able to access them as if it was a list.

@JeanKossaifi
Copy link

@JeanKossaifi JeanKossaifi commented Feb 8, 2021

@albanD what about #40589?

We are using ParameterList in our project and this bug breaks the library for several users who are using DataParallel. Do you recommend implementing our own ParameterList as an nn.Module until this is fixed?

xptree added a commit to laekov/fastmoe that referenced this issue Feb 24, 2021
JeanKossaifi referenced this issue in tensorly/torch Mar 5, 2021
@zanshuxun
Copy link

@zanshuxun zanshuxun commented Mar 6, 2021

same error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.