SynthNID: Synthetic Data to Improve End-to-end Bangla Document Key Information Extraction

Published in BLP Workshop at EMNLP 2023, Singapore, 2023

Authors: Syed Monsur*, Shariar Kabir*, Sakib Chowdhury*   (* co-first author)

End-to-end Document Key Information Extraction models require a lot of compute and labeled data to perform well on real datasets. This is particularly challenging for low-resource languages like Bangla where domain-specific multimodal document datasets are scarcely available. In this paper, we have introduced SynthNID, a system to generate domain-specific document image data for training OCR-less end-to-end Key Information Extraction systems. We show the generated data improves the performance of the extraction model on real datasets and the system is easily extendable to generate other types of scanned documents for a wide range of document understanding tasks.

[PDF]   [ACL Anthology]

Recommended citation: Syed Monsur, Shariar Kabir, Sakib Chowdhury. (2023). "SynthNID: Synthetic Data to Improve End-to-end Bangla Document Key Information Extraction." BLP Workshop at EMNLP 2023, pages 117–123, Singapore.
Download Paper