ICDAR 2023 Competition and CLEF 2023 Lab on Document Information Localization and Extraction
DocILE is a large-scale research benchmark for cross-evaluation of machine learning methods for Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR) from semi-structured business documents such as invoices, orders etc. Such large-scale benchmark was previously missing (Skalický et al., 2022), hindering comparative evaluation.
Quick Links:
Join DocILE announcements google group to receive updates and news about the challenge
Form to get access to the dataset
CLEF labs registration form
DocILE Rules and Prize Eligibility
Github repository
Dataset paper (Šimsa et al., 2023) with Supplementary Material
Make submissions via Robust Reading Competition website
Competition and Prizes
The DocILE'23 competition is running as a CLEF 2023 lab and an ICDAR 2023 competition with a single leaderboard. The deadline for submissions is on May 10 May 24, 2023.
The competition comes with a prize pool of $9000:
- Top three eligible teams on the KILE leaderboard will receive $2000, $1000 and $500 respectively.
- Top three eligible teams on the LIR leaderboard will receive $2000, $1000 and $500 respectively.
- A $2000 best-paper award, selected by the lab organizers and the steering committee.
- You need to register using the CLEF Labs Registration Form.
- It is prohibited to use external document datasets and models trained on these datasets.
Data
DocILE is the largest dataset of business documents with KILE and LIR labels for 106680 documents (6680 annotated and 100k synthetic) and almost 1M unlabeled documents for unsupervised pre-training.
Fill out the Dataset Access Request form to get access to the data for research purposes.
We provide rossumai/docile repository to easily load the data, visualize the annotations and to run the evaluation.
[Legal information on the processing of personal data for the purpose of scientific research.]
Dataset and Benchmark Paper
The dataset, the benchmark tasks and the evaluation criteria are described in detail in the dataset paper (Šimsa et al., 2023) which was accepted to ICDAR 2023. The provided link is to arXiv version that includes Supplementary Material. To cite the dataset, please use the following BibTeX entry:
@misc{simsa2023docile, title={{DocILE} Benchmark for Document Information Localization and Extraction}, author={{\v{S}}imsa, {\v{S}}t{\v{e}}p{\'a}n and {\v{S}}ulc, Milan and U{\v{r}}i{\v{c}}{\'a}{\v{r}}, Michal and Patel, Yash and Hamdi, Ahmed and Koci{\'a}n, Mat{\v{e}}j and Skalick{\`y}, Maty{\'a}{\v{s}} and Matas, Ji{\v{r}}{\'\i} and Doucet, Antoine and Coustaty, Micka{\"e}l and Karatzas, Dimosthenis}, url = {https://arxiv.org/abs/2302.05658}, journal={arXiv preprint arXiv:2302.05658}, year={2023} }
Organizers
Contact the organizers at: docile-2023-organizers@googlegroups.com
Task Chairs:
Milan Šulc, Head of Rossum AI Labs, Rossum.ai, Czech Republic.
Štěpán Šimsa, Researcher at Rossum AI Labs, Rossum.ai, Czech Republic.
Co-Organizers:
Ahmed Hamdi, Associate Professor at the University of La Rochelle, France.
Yash Patel, PhD. candidate at the Visual Recognition Group, Czech Technical University in Prague, Czech Republic.
Matyáš Skalický, Research Engineer at Rossum.ai, Czech Republic.
Steering Committee:
Michal Uřičář, Researcher at Rossum AI Labs, Rossum.ai, Czech Republic.
Antoine Doucet, Full Professor of Computer Science at the University of La Rochelle, France.
Mickael Coustaty, Associate Professor at the University of La Rochelle, France.
Dimosthenis Karatzas, Associate Director of the Computer Vision Center, Barcelona, Spain.