Problem

I have trained and saved model checkpoint which was train in GPU 1. However, when I try to load the checkpoint in GPU 0, it fails.

The following is my ckpt loading code:

net = Net()
net.load_state_dict(torch.load(ckpt_path))

net.cuda(device) # device = GPU0

this gives me the following error:

Traceback (most recent call last):
  File "train.py", line 127, in <module>
    net.load_state_dict(torch.load(ckpt_path))
  File "/data/chadrick/venv/tf113/lib/python3.7/site-packages/torch/serialization.py", line 529, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/data/chadrick/venv/tf113/lib/python3.7/site-packages/torch/serialization.py", line 702, in _legacy_load
    result = unpickler.load()
  File "/data/chadrick/venv/tf113/lib/python3.7/site-packages/torch/serialization.py", line 665, in persistent_load
    deserialized_objects[root_key] = restore_location(obj, location)
  File "/data/chadrick/venv/tf113/lib/python3.7/site-packages/torch/serialization.py", line 156, in default_restore_location
    result = fn(storage, location)
  File "/data/chadrick/venv/tf113/lib/python3.7/site-packages/torch/serialization.py", line 136, in _cuda_deserialize
    return storage_type(obj.size())
  File "/data/chadrick/venv/tf113/lib/python3.7/site-packages/torch/cuda/__init__.py", line 480, in _lazy_new
    return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA error: out of memory

Solution

when loading ckpt data, ensure that device is mapped from some GPU to CPU. The fixed loading code is like this:

net = Net()
load_data = torch.load(ckpt_path, map_location='cpu')
net.load_state_dict(load_data)

net.cuda(device)

the problematic behavior described above is also explained in the official docs.


0 Comments

Leave a Reply

Your email address will not be published.