Task 3: Generating Full Fact-Checking Articles

Definition

This task introduces a new, final task into the CLEF CheckThat! lab pipeline, which attempts to automate the fact-checking article writing process. Here, given a claim, its veracity, and a set of evidence documents consulted for fact-checking the claim, generate a full fact-checking article.

Datasets

Input

We reuse the WatClaimCheck dataset as training and validation sets. To download and use a copy of the WatClaimCheck dataset, follow the instructions available here. The dataset also provides scraped content from the premise articles / evidence webpages where available.

The test set as well as the test set with ground truth articles are available on Google Drive here.

Expected Output

For each datapoint, we expect the output format in the following format:

{
    "id": 42,
    "factchecking_article": "<corresponding generated article for \"OpIndia claimed Greta Thunberg's real name is Ghazala bhat -- by Faisal Al Qasimi, Carolina Monteiro\">"
}

Furthermore, the evaluation script expects inline citations wherever necessary in the following format:

"This is a sample sentence making a claim grounded in some evidence (source: https://www.example1.com). This is another sample sentence making a claim grounded in multiple evidence (source: https://www.example2.com; https://www.example3.com)."

If citations are not present (or are not present in this format), the evaluation script will penalise your submissions. Please reach out in case your submissions have any issues. We will be available for any rectifications to ensure fair evaluation for everyone.

Evaluation

The evaluation scripts are available here.

We will use the mean of the following metrics to assess the generated text: (i) entailment score, a reference-based metric that measures if the generated text is entailed by the reference; (ii) citation recall and (iii) citation precision, as proposed by Gao et al. (2023); and (iv) evidence coverage, which computes the proportion of input evidence that is correctly cited in the generated text.

Note that the evaluation script expects inline citations wherever necessary in the following format:

"This is a sample sentence making a claim grounded in some evidence (source: https://www.example1.com). This is another sample sentence making a claim grounded in multiple evidence (source: https://www.example2.com; https://www.example3.com)."

Submission

Scorer, Format Checker, and Baseline Scripts

The scorer, format checker and baseline scripts were released and are available here. Please follow the instructions in the README file for the setup.

Submission Site

UPDATE (24 April 2026): The test phase has started. Submission site: https://www.codabench.org/competitions/15892/.

UPDATE (10 May 2026): The test phase has ended. Please find the leaderboard below.

Submission Guidelines

Test phase submissions on CodaBench: https://www.codabench.org/competitions/15892/

You may make an unlimited number of submissions during the test phase. However, only the last valid submission will be evaluated to rank each team in the final leaderboard.
We have provided a starter code (check the Files tab on CodaBench) to help generate the predictions. Specifically, you should generate your predictions in the expected output format as explained in the Dataset tab as a json file. Name the file prediction.json. Create a ZIP containing only prediction.json. Submit this ZIP for scoring.
We urge the participants to run the format checker on their submissions to ensure their submissions are valid. In case the participants run into any unforeseen issues, please contact dhruv.sahnan@mbzuai.ac.ae. We have explained the input and output formats in the Dataset tab.
Additionally, CodaBench’s submission portal will verify if the format of your submission follows the expected output. If it fails, the submission will be discarded.

Leaderboard

UPDATE (24 April 2026): The test phase has started. Submission site: https://www.codabench.org/competitions/15892/. The leaderboard is hidden at the moment, and will be made public at the end of the test phase.

UPDATE (10 May 2026): The test phase has ended. The leaderboard for all the teams that participated is presented below:

Team Name	Mean Entailment Score	Mean Citation Precision	Mean Citation Recall	Mean Evidence Coverage	Mean Score
🥇 Outsider	0.278	0.671	0.671	0.563	0.546
🥈 UTS	0.341	0.299	0.299	0.996	0.484
🥉 DS GT CheckThat	0.295	0.240	0.276	0.902	0.428
Facty	0.229	0.231	0.254	0.995	0.427
UCPH RMT	0.364	0.144	0.174	0.993	0.419
Ro	0.321	0.463	0.463	0.343	0.398
Sourceminds	0.245	0.337	0.339	0.394	0.329
CheckMate	0.221	0.350	0.350	0.391	0.328
KNU Fact	0.227	0.287	0.295	0.440	0.312
Baseline	0.298	0.223	0.240	0.329	0.272
Shtian (not registered in CLEF)	0.279	0.194	0.198	0.210	0.220

UPDATE (12 May 2026): Additional late submissions or submissions outside of the CodaBench competition:

Team Name	Mean Entailment Score	Mean Citation Precision	Mean Citation Recall	Mean Evidence Coverage	Mean Score
Team AKSH	0.329	0.301	0.318	0.564	0.378

Note:

Team names are as they were provided by the participants in the CLEF registration form, not from the CodaBench submission. In case (as a participant), you do not see your team listed in this leaderboard or if you think there are any mistakes on the leaderboard, please reach out on dhruv.sahnan@mbzuai.ac.ae.
Evaluation results for each team have also been made public and are available on the CheckThat! Lab GitLab repository here.

Organizers

Dhruv Sahnan, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
Tanmoy Chakraborty, Indian Institute of Technology - Delhi, India
Preslav Nakov, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates

Contact

Contact dhruv.sahnan@mbzuai.ac.ae for any questions.

CheckThat! Lab at CLEF 2026

Contents