ICDAR 2023 Competition and CLEF 2023 Lab on Document Information Localization and Extraction

DocILE is a large-scale research benchmark for cross-evaluation of machine learning methods for Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR) from semi-structured business documents such as invoices, orders etc. Such large-scale benchmark was previously missing (Skalický et al., 2022), hindering comparative evaluation.

Quick Links:
Join DocILE announcements google group to receive updates and news about the challenge
Form to get access to the dataset
CLEF labs registration form
DocILE Rules and Prize Eligibility
Github repository
Dataset paper (Šimsa et al., 2023) with Supplementary Material
Make submissions via Robust Reading Competition website

Competition and Prizes

The DocILE'23 competition is running as a CLEF 2023 lab and an ICDAR 2023 competition with a single leaderboard. The deadline for submissions is on May 10 May 24, 2023. The competition comes with a prize pool of $9000:

  • Top three eligible teams on the KILE leaderboard will receive $2000, $1000 and $500 respectively.
  • Top three eligible teams on the LIR leaderboard will receive $2000, $1000 and $500 respectively.
  • A $2000 best-paper award, selected by the lab organizers and the steering committee.
In order to participate in the competition (and be eligible for prizes), you need to follow the competition rules. Most notably:
  • You need to register using the CLEF Labs Registration Form.
  • It is prohibited to use external document datasets and models trained on these datasets.
The competition/benchmark submissions will be handled via the DocILE page at the Robust Reading Competition website. The benchmark will stay live after the competition ends.

Data

DocILE is the largest dataset of business documents with KILE and LIR labels for 106680 documents (6680 annotated and 100k synthetic) and almost 1M unlabeled documents for unsupervised pre-training.

Fill out the Dataset Access Request form to get access to the data for research purposes.

We provide rossumai/docile repository to easily load the data, visualize the annotations and to run the evaluation.

[Legal information on the processing of personal data for the purpose of scientific research.]

Dataset and Benchmark Paper

The dataset, the benchmark tasks and the evaluation criteria are described in detail in the dataset paper (Šimsa et al., 2023) which was accepted to ICDAR 2023. The provided link is to arXiv version that includes Supplementary Material. To cite the dataset, please use the following BibTeX entry:

@misc{simsa2023docile,
    title={{DocILE} Benchmark for Document Information Localization and Extraction},
    author={{\v{S}}imsa, {\v{S}}t{\v{e}}p{\'a}n and {\v{S}}ulc, Milan and U{\v{r}}i{\v{c}}{\'a}{\v{r}}, Michal and Patel, Yash and Hamdi, Ahmed and Koci{\'a}n, Mat{\v{e}}j and Skalick{\`y}, Maty{\'a}{\v{s}} and Matas, Ji{\v{r}}{\'\i} and Doucet, Antoine and Coustaty, Micka{\"e}l and Karatzas, Dimosthenis},
    url = {https://arxiv.org/abs/2302.05658},
    journal={arXiv preprint arXiv:2302.05658},
    year={2023}
}

Organizers

Contact the organizers at: docile-2023-organizers@googlegroups.com

Task Chairs:
Milan Šulc, Head of Rossum AI Labs, Rossum.ai, Czech Republic.
Štěpán Šimsa, Researcher at Rossum AI Labs, Rossum.ai, Czech Republic.

Co-Organizers:
Ahmed Hamdi, Associate Professor at the University of La Rochelle, France.
Yash Patel, PhD. candidate at the Visual Recognition Group, Czech Technical University in Prague, Czech Republic.
Matyáš Skalický, Research Engineer at Rossum.ai, Czech Republic.

Steering Committee:
Michal Uřičář, Researcher at Rossum AI Labs, Rossum.ai, Czech Republic.
Antoine Doucet, Full Professor of Computer Science at the University of La Rochelle, France.
Mickael Coustaty, Associate Professor at the University of La Rochelle, France.
Dimosthenis Karatzas, Associate Director of the Computer Vision Center, Barcelona, Spain.