Matthew D. Zeiler, Rob Fergus (2013)
- Novel visualization technique to understand representations learned by intermediate layers of a CNN
- Propose architecture changes based on this --> resulting model performs and generalizes better!
- DeConvNet used for this: feature activations mapped back to input space by setting other activations = 0, and subsequently unpooling, rectification, filtering
- Unpooling: approximated using switch variables to remember highest input activation locations --> visualizations are image-specific!
- Rectification: pass through ReLU
- Filtering: convolving reconstructed signal with transpose of convolutional layer filters
- Lower layers converge within a few epochs, while upper layers need more epochs to develop
- Small transformations in the image have a larger effect on lower layers
- Model is fairly stable to translation + scaling, not rotation
- Drop in activities in feature map when object is occluded:
- CNNs implicitly learn correspondence between different parts, as shown through lower scores when occluding the same object for various poses
- Minimum depth of model, rather than any individual section, is vital to performance!