paper summary: “DocFormer: End-to-End Transformer for Document Understanding”

arxiv: https://arxiv.org/abs/2106.11539 this work proposes a backbone for visual document understanding domain. It uses text, visual, spatial features. Key points use text, visual, spatial features at each encoding layer, keep feeding in visual and spatial features on the input side. This has the ‘residual’ connection effect. text and visual features Read more…

paper summary: “BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents”

arxiv: https://arxiv.org/abs/2108.04539 key points use text and spatial information. doesn’t utilize image feature a better spatial information encoding method compared to LayoutLM propose new pretraining task: Area Masked Language Model spatial information encoding method For each text box, get four corner point x,y coordinates and normalize them all with image Read more…