paper review: "High-Performance Large-Scale Image Recognition Without Normalization"

Mar 2, 2021

adaptive-gradient-clipping agc nfnets normalizer-free

key points

introduce NF nets which combines multiple ideas to avoid using batch norm to get on-par performance
but along with just using a bunch of non-BN techniques, this paper introduces adaptive gradient clipping(AGC) to make it actually train well to reach comparable results matching that of using BN

good
- smoothens loss landscape, enabling larger lr and larger batch size
- regularizing effect
the bad
- expensive computation
- discrepency in behavior in train/test
- breaks independence bewteen training examples in minibatch
- leaks information, which causes problems in sequential modeling tasks, contrastive learning algorithms.
- performance degrade when there is large variance during training
- BN is sensitive to batch size, and perform poorly when batch size is too small.

propose NF resnets
doesn’t use normalization layers
employ different residual block
looks like it also used scaled weight standardization to prevent mean-shift in hidden activations.
employ dropout, stochastic depth
this way it outperforms BN trained resnets on low batch size. it doesn’t in large batch size.
does not outperform efficientnets

to scale nf-resnets to larger batch sizes, use adaptive gradient clipping.
below is the AGC policy

ratio of norm of gradient(G) to the norm of weights(W) provide simple measure of how much a single gradient descent step will change the original weights.
clip gradients based on unit-wise ratios of gradient norm over weight norm. this works better empirically.
still, the clipping threshold(λ) is a hyper-parameter.
AGC can be thought as a relaxed version of normalized optimizers.
using AGC, larger batch sizes can be used in training.
benefit of using AGC in smaller batch size is smaller
not using AGC in last layer is good practice

out of 4 levels, the third level seems to be the best place to increase capacity.
depth pattern = changing number of depth levels, or the size of each level / width pattern = changing channel size
scaling drop rate of dropout was good practice. This seems to be important since NFnets don’t receive the regularization effect that existed in batch norm.