Problem
I have trained and saved model checkpoint which was train in GPU 1. However, when I try to load the checkpoint in GPU 0, it fails.
The following is my ckpt loading code:
net = Net() net.load_state_dict(torch.load(ckpt_path)) net.cuda(device) # device = GPU0
this gives me the following error:
Traceback (most recent call last): File "train.py", line 127, in <module> net.load_state_dict(torch.load(ckpt_path)) File "/data/chadrick/venv/tf113/lib/python3.7/site-packages/torch/serialization.py", line 529, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/data/chadrick/venv/tf113/lib/python3.7/site-packages/torch/serialization.py", line 702, in _legacy_load result = unpickler.load() File "/data/chadrick/venv/tf113/lib/python3.7/site-packages/torch/serialization.py", line 665, in persistent_load deserialized_objects[root_key] = restore_location(obj, location) File "/data/chadrick/venv/tf113/lib/python3.7/site-packages/torch/serialization.py", line 156, in default_restore_location result = fn(storage, location) File "/data/chadrick/venv/tf113/lib/python3.7/site-packages/torch/serialization.py", line 136, in _cuda_deserialize return storage_type(obj.size()) File "/data/chadrick/venv/tf113/lib/python3.7/site-packages/torch/cuda/__init__.py", line 480, in _lazy_new return super(_CudaBase, cls).__new__(cls, *args, **kwargs) RuntimeError: CUDA error: out of memory
Solution
when loading ckpt data, ensure that device is mapped from some GPU to CPU. The fixed loading code is like this:
net = Net() load_data = torch.load(ckpt_path, map_location='cpu') net.load_state_dict(load_data) net.cuda(device)
the problematic behavior described above is also explained in the official docs.
1 Comment
Anonymous · April 29, 2021 at 4:57 am
Thanks! very helpful