Summary
The paper introduces Layer Normalization, a method for normalization which standardizes each input across all of its features, instead of standardizing each feature across all inputs in a batch (batch normalization, BN). Unlike BN, there is no intra-batch dependency, no need to keep track of means and variances, and no changes in inference (so it works for online inference). This new method is also straightforward to apply to RNNs, unlike BN which needs different statistics per time step and layer.
Deep dive
What is goal of the paper?
- Why does it matter?
- What is the significance/impact of the conclusion?
What is the approach?
- Is the approach well-motivated given existing literature?
Batch Normalization
MLP case Assume with being the batch size and being the number of features. Then, for each feature in :
Extra Notes
- Highly dependant on batch size
- Hard to apply to RNNs
- Needs to keep track of running statistics for inference, uses momentum
Layer Normalization Standardizes each input example across all of its features.
- Does not depend on batch size
- Can be easily applied to RNNs
- Does not change during inference
Results
- Are they correct?
- Are they rigorous?
- What else could be done?
- Datasets + experiments
- Evaluation metrics
Next steps
- Where do we go from here?