paper review: “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension”

arxiv: https://arxiv.org/abs/1910.13461 key points propose autoregressive model named BART, which is architecturally similar to standard transformer encoder + decoder Check out 5 pretraining tasks, and experiment which pretraining task is most helpful test BART performance with large scale pretraining on downstream tasks Model Architecture This work introduces BART, which is fundamentally Read more…

paper summary: “DocFormer: End-to-End Transformer for Document Understanding”

arxiv: https://arxiv.org/abs/2106.11539 this work proposes a backbone for visual document understanding domain. It uses text, visual, spatial features. Key points use text, visual, spatial features at each encoding layer, keep feeding in visual and spatial features on the input side. This has the ‘residual’ connection effect. text and visual features Read more…

paper summary: “BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents”

arxiv: https://arxiv.org/abs/2108.04539 key points use text and spatial information. doesn’t utilize image feature a better spatial information encoding method compared to LayoutLM propose new pretraining task: Area Masked Language Model spatial information encoding method For each text box, get four corner point x,y coordinates and normalize them all with image Read more…

paper summary “Perceiver IO: A General Architecture for Structured Inputs & Outputs”

arxiv: https://arxiv.org/abs/2107.14795 Key points developing upon the Perceiver idea, Perceiver IO proposes a Perceiver like structure but where output size can be much larger and still keep overall complexity linear. (Checkout summary on Perceiver here) same with Perceiver, this work use latent array to save input information and run this through multiple self Read more…

paper summary: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

arxiv: https://arxiv.org/abs/2103.14030 Key points multi scale feature extraction. Could think of as adoption of FPN idea. restrict transformer operation to within each window and not the entire feature map → allows to keep overall complexity linear instead of quadratic apply shifted window to allow inter-window interaction fuse relative position information in Read more…

paper summary: “VarifocalNet: An IoU-aware Dense Object Detector”(VFNet)

arxiv: https://arxiv.org/abs/2008.13367 key points another anchor-free point based object detection network introduce new loss, varifocal loss which is a forked version from focal loss. Makes some changes from focal loss to compensate positive/negative imbalance futher. instead of prediction classification and IOU score separately, this work predicts a single scalar which Read more…