## Properly setting dataloader and callback for validation in pytorch DDP

pytorch distributed data parallel(DDP) is very useful and relatively well provided for creating a distributed training setup. However, the provided documentations and tutorial are mostly about “training” part and didn’t talk much about validation callbacks that run during training.

It is easy to think just using DistributedSampler for the validation dataloader would do all the work for you like it did in training dataloader, but it doesn’t. There are two main problems.

## pytorch implementation of sinusoidal position encoding

There are existing sinusoidal position encoding modules out there, but the ones that I confronted were mostly assuming the position to be incrementing from 0 to the size of sequence. For example, when a token embedding sequence with shape of (B, L, D_token) is given then the sinusoidal position encoding module will take this tensor as input and manually create a tensor (B,L) where the values for each row is (0,1,2,3, …., L-1) and then apply sinusoidal encoding on this.

## cross entropy loss / focal loss implmentation in pytorch

at the moment, the code is written for torch 1.4 binary cross entropy loss currently, torch 1.6 is out there and according to the pytorch docs, the torch.max function can receive two tensors and return element-wise max values. However, in 1.4 this feature is not yet supported and that is

## avoiding full gpu memory occupation during training in pytorch

Problem While training even a small model, I found that the gpu memory occupation neary reached 100%. This seemed odd and it made me to presume that my pytorch training code was not handling gpu memory management properly. Here is a pseudo code for my pytorch training script. With this

## applying xavier normal initialization to conv/linear layer(module) in pytorch

in tensorflow, default initialization used is glorot normal initialization which is also known as xavier normal initialization. To use the same setting in pytorch, the following practice should be done. 2d convolution module example linear module example pytorch supports other initialization functions, and one can use those initialization functions in