avoiding full gpu memory occupation during training in pytorch
Problem
While training even a small model, I found that the gpu memory occupation neary reached 100%. This seemed odd and it made me to presume that my pytorch training code was not handling gpu memory management properly.
Here is a pseudo code for my pytorch training script.
net = Net().cuda()
optimizer = torch.nn.optim.Adam(net.parameters(), lr=1e-3)
for i in range(steps):
optimizer.zero\_grad()
batch\_input\_data, batch\_gt\_data = get\_some\_data()
batch\_input\_tensor = torch.from\_numpy(batch\_input\_data).cuda()
batch\_gt\_tensor = torch.from\_numpy(batch\_gt\_data).cuda()
out = net(batch\_input\_tensor)
loss = loss\_fn(out, batch\_gt\_tensor)
loss.backward()
optimizer.step()
With this code, not only does my gpu memory occupation reach 100%, in some cases when the batch input data becomes large(either larger batch size or if a single input data is exceptionally larger than others in case where individual input data are variant) it will fail to even get loaded to gpu memory and will fail to proceed and raise and exception.
Solution
I assumed that finished step tensors which have been loaded to gpu memory are not being released properly. Therefore I manually added gpu memory releasing code lines to the train script and the pseudo code looks like this.
net = Net().cuda()
optimizer = torch.nn.optim.Adam(net.parameters(), lr=1e-3)
for i in range(steps):
optimizer.zero\_grad()
batch\_input\_data, batch\_gt\_data = get\_some\_data()
batch\_input\_tensor = torch.from\_numpy(batch\_input\_data).cuda()
batch\_gt\_tensor = torch.from\_numpy(batch\_gt\_data).cuda()
# added this line
torch.cuda.empty\_cache()
out = net(batch\_input\_tensor)
loss = loss\_fn(out, batch\_gt\_tensor)
loss.backward()
optimizer.step()
The intention is that when batch_input_tensor
and batch_gt_tensor
variable has been allocated with fresh data tensors, the old tensor which these variables held previously will be handled forcefully with the added line of code.
I’m not entirely sure if understanding is correct but this does the job. After make this change, the gpu memory occupation throughout training was on average 20%.
According to the docs, deleting the variables that hold gpu tensors will release gpu memory but simply deleting them alone didn’t release gpu memory instantly. For instant gpu memory release, deleting AND calling torch.cuda.empty_cache()
was necessary.
In the case above, we are in a training loop and reusing batch_input_tensor
and batch_gt_tensor
variables. Because we are reusing the variables, I didn’t manually delete them. But if I wanted to reduce gpu memory usage further, I could update the training code like this.
net = Net().cuda()
optimizer = torch.nn.optim.Adam(net.parameters(), lr=1e-3)
for i in range(steps):
optimizer.zero\_grad()
batch\_input\_data, batch\_gt\_data = get\_some\_data()
batch\_input\_tensor = torch.from\_numpy(batch\_input\_data).cuda()
batch\_gt\_tensor = torch.from\_numpy(batch\_gt\_data).cuda()
out = net(batch\_input\_tensor)
loss = loss\_fn(out, batch\_gt\_tensor)
loss.backward()
# added lines
del batch\_input\_data
del batch\_gt\_data
torch.cuda.empty\_cache()
optimizer.step()
I have manually deleted batch_input_data
and batch_gt_data
this time, but this must be done after loss.backward()
since this is the last line that utilizes the tensors saved in the two variables. If we delete the two variables before this line, then backprop will raise and error.
The key is to delete the variables after any related tensor operations are finished, and call torch.cuda.empty_cache()
after deleting the variables.