paper link

submitted in 2016

I’m only interested in the stacked hourglass architecture, not about pose segmentation performance. So the points listed below are only related to stacked hourglass architecture.

  • “encoding-decoding” or “conv-deconv” structure is already introduced. This paper goes one step further and stacks muiltiple “hourglass” structure. While stacking, each hourglass output will be supervised.
  • compared to FCN, a single hourglass structure used in this paper has a more symmetrical structure.
  • conv layers and maxpooling used in downsampling. nearest neighbor upsampling and elementwise addition is used during upsampling.
  • Each hourglass output will go through 1×1 conv layer to be used as intermediate output. this will be added back to the output feature map and then passed on to the next hourglass module.
  • The paper does some tests, with single hourglass module and stacked hourglass module.
  • use MSE loss
  • two ablation studies: 1) effect of intermediate supervision 2) effect of number of stacked hourglasses
  • for study on intermediate supervision, baseline is 8-stacked hourglass. when then number of stacks are decreased, each hourglass capacity will be increased for compensation. Results show that more stacks do provide better results.
  • for study on intermediate supervision, although intermediate supervison applied to single HG gives good results, doing both is better


Leave a Reply

Your email address will not be published.