Chadrick Blog

paper review: "High-Performance Large-Scale Image Recognition Without Normalization"

arxiv: https://arxiv.org/pdf/2102.06171v1.pdf

key points

  • introduce NF nets which combines multiple ideas to avoid using batch norm to get on-par performance
  • but along with just using a bunch of non-BN techniques, this paper introduces adaptive gradient clipping(AGC) to make it actually train well to reach comparable results matching that of using BN

benefit of batch normalization

  • good
    • smoothens loss landscape, enabling larger lr and larger batch size
    • regularizing effect
  • the bad
    • expensive computation
    • discrepency in behavior in train/test
    • breaks independence bewteen training examples in minibatch
    • leaks information, which causes problems in sequential modeling tasks, contrastive learning algorithms.
    • performance degrade when there is large variance during training
    • BN is sensitive to batch size, and perform poorly when batch size is too small.

NFNets

  • propose NF resnets
  • doesn’t use normalization layers
  • employ different residual block
  • looks like it also used scaled weight standardization to prevent mean-shift in hidden activations.
  • employ dropout, stochastic depth
  • this way it outperforms BN trained resnets on low batch size. it doesn’t in large batch size.
  • does not outperform efficientnets

Adapative gradient clipping(AGC)

  • to scale nf-resnets to larger batch sizes, use adaptive gradient clipping.
  • below is the AGC policy

    ratio of norm of gradient(G) to the norm of weights(W) provide simple measure of how much a single gradient descent step will change the original weights.
  • clip gradients based on unit-wise ratios of gradient norm over weight norm. this works better empirically.
  • still, the clipping threshold(λ) is a hyper-parameter.
  • AGC can be thought as a relaxed version of normalized optimizers.
  • using AGC, larger batch sizes can be used in training.
  • benefit of using AGC in smaller batch size is smaller
  • not using AGC in last layer is good practice

ablation study

  • out of 4 levels, the third level seems to be the best place to increase capacity.
  • depth pattern = changing number of depth levels, or the size of each level / width pattern = changing channel size
  • scaling drop rate of dropout was good practice. This seems to be important since NFnets don’t receive the regularization effect that existed in batch norm.

Comments

  • efficientnet: inverted bottlenet block ?