Summary

The paper introduces Layer Normalization, a method for normalization which standardizes each input across all of its features, instead of standardizing each feature across all inputs in a batch (batch normalization, BN). Unlike BN, there is no intra-batch dependency, no need to keep track of means and variances, and no changes in inference (so it works for online inference). This new method is also straightforward to apply to RNNs, unlike BN which needs different statistics per time step and layer.

Deep dive

What is goal of the paper?

  • Why does it matter?
  • What is the significance/impact of the conclusion?

What is the approach?

  • Is the approach well-motivated given existing literature?

Batch Normalization

MLP case Assume with being the batch size and being the number of features. Then, for each feature in :

Extra Notes

  • Highly dependant on batch size
  • Hard to apply to RNNs
  • Needs to keep track of running statistics for inference, uses momentum

Layer Normalization Standardizes each input example across all of its features.

  • Does not depend on batch size
  • Can be easily applied to RNNs
  • Does not change during inference

Results

  • Are they correct?
  • Are they rigorous?
  • What else could be done?
  • Datasets + experiments
  • Evaluation metrics

Next steps

  • Where do we go from here?