paper summary: “DocFormer: End-to-End Transformer for Document Understanding”

arxiv: https://arxiv.org/abs/2106.11539 this work proposes a backbone for visual document understanding domain. It uses text, visual, spatial features. Key points use text, visual, spatial features at each encoding layer, keep feeding in visual and spatial features on the input side. This has the ‘residual’ connection effect. text and visual features Read more…