Obviously, the strategy in Ciresan et al. [1] has two drawbacks.
First, it is quite slow because the network must be run separately for each patch, and there is a lot of redundancy due to overlapping patches.
Secondly, there is a trade-off between localization accuracy and the use of context.
Larger patches require more max-pooling layers that reduce the localization accuracy, while small patches allow the network to see only little context.
More recent approaches [11,4] proposed a classifier output that takes into account the features from multiple layers.
最近的一種分類器,考慮了多層的輸出,未看先猜可能是densenet 或者是 resnet。
Good localization and the use of context are possible at the same time.
[4] Hypercolumns for Object Segmentation and Fine-grained Localization
這篇文章提出了一個問題,一般的CNN最後一層就只有語意資訊,但是將每一層convolution 後的feature做 upsample(有點類似反卷積),可以萃取不同尺度的特徵。
[11] Image segmentation with cascaded hierarchical models and logistic disjunctive normal networks
這一張不明所以的圖,簡單來說就是萃取不同尺度的特徵與upsample組成,沒一個階段都有分類的成果,分類的成果又能當作分類的特徵,中間又有Max pooling。
[4] Hypercolumns for Object Segmentation and Fine-grained Localization
[11] Image segmentation with cascaded hierarchical models and logistic disjunctive normal networks