CheckThat! Lab at CLEF 2026

Home

Editions

Tasks

Contents

Task 2: Fact-Checking Numerical Claims

Definition

This task involves verifying naturally occurring claims containing numerical quantities and temporal expressions by improving the reasoning process of Large Language Models (LLMs) through test-time scaling. In contrast to previous editions that focused solely on fact-checking accuracy, this year’s task integrates rationale generation into the evaluation, assessing both the correctness of the veracity prediction and the reasoning quality of the model’s explanation. Though the claims to be fact-checked are leveraged from the previous version, the task setup differs from the last iteration. In this version of the task, we propose a test-time scaling setup to improve the performance of LLM reasoning for claim verification. Given claims, each with associated top-10 evidences, possible reasoning paths are generated and provided as input. Given this data participants are expected to train a verifier model that can help rank the reasoning paths for test claims along with output of verdict from top-ranked reasoning path. We will also release new evaluation (test) sets to avoid leakage and for rigorous evaluation

Datasets

The dataset is collected from various fact-checking domains, complete with detailed metadata and an evidence corpus sourced from the web. We will use the English dataset released in QuanTemp, as well as the 3260 Arabic and 2808 Spanish claims from the previous edition of CLEF 2025 with corresponding evidence collection. An overview of dataset statistics is shown in the Table below. Each claim in train and validation sets will comprise 20 reasoning paths generated by a LLM which can be used by participants to train a verification model for selecting best reasoning path. Apart from this, the entire evidence corpus from which the top-10 associated evidence are retrieved and re-ranked will also be provided to the users so they can run their own retrieve and re-rank pipelines for better performance.

Dataset statistics for task 2

Language # of claims
English 8000
Spanish 2808
Arabic 3260

Evaluation

We will employ Recall@k and MRR@k to evaluate the reasoning paths ranked by the verifier model. We will employ macro-averaged F1 and classwise F1 scores for evaluating claim verification. The reasoning paths that lead to the correct evrdict are treated as ground truth.

Submission

Scorer, Format Checker, and Baseline Scripts

TBA

Submission Site

TBA

Submission Guidelines

TBA

Leaderboard

TBA

Organizers

TBA

Contact

TBA