pytorch distributed data parallel(DDP) is very useful and relatively well provided for creating a distributed training setup. However, the provided documentations and tutorial are mostly about “training” part and didn’t talk much about validation callbacks that run during training.
It is easy to think just using
DistributedSampler for the validation dataloader would do all the work for you like it did in training dataloader, but it doesn’t. There are two main problems.