Visualizing and understanding convolutional networks

Matthew D. Zeiler, Rob Fergus (2013)

Key points

Novel visualization technique to understand representations learned by intermediate layers of a CNN
- Propose architecture changes based on this --> resulting model performs and generalizes better!
DeConvNet used for this: feature activations mapped back to input space by setting other activations = 0, and subsequently unpooling, rectification, filtering
- Unpooling: approximated using switch variables to remember highest input activation locations --> visualizations are image-specific!
- Rectification: pass through ReLU
- Filtering: convolving reconstructed signal with transpose of convolutional layer filters
Lower layers converge within a few epochs, while upper layers need more epochs to develop
Small transformations in the image have a larger effect on lower layers
- Model is fairly stable to translation + scaling, not rotation
Drop in activities in feature map when object is occluded:
- CNNs implicitly learn correspondence between different parts, as shown through lower scores when occluding the same object for various poses
Minimum depth of model, rather than any individual section, is vital to performance!