Task 2: Fact-Checking Numerical Claims

Definition

This task performs test-time scaling for fact-verification. Given a claim, relevant evidences and reasoning traces with corresponding verdicts, the task is to rank the reasoning traces based on their utility in leading to the correct verdict and also output the final verdict from the top-ranked reasoning traces. The claims are available in English, Spanish and Arabic. The task focuses on verifying naturally occurring claims that contain quan- titative quantities and/or temporal expressions by improving the reasoning ca- pabilities of Large Language Models (LLMs) through test-time scaling. Unlike previous editions, which primarily focused only on fact-checking accuracy, this year’s task explicitly incorporates rationale generation for performing reasoning into the evaluation. This is achieved in the form of re-ranking reasoning traces from the LLM, assessing both the correctness of the predicted veracity and the quality of the underlying reasoning traces. While the claims are drawn from the previous iteration of the task, the overall setup differs substantially. Specifically, we introduce a test-time scaling frame- work designed to enhance LLM reasoning for claim verification. Given a claim and its associated evidence, multiple reasoning traces are generated using an LLM with varying temperature values to induce diversity. Redundant traces are subsequently removed through a de-duplication step. Based on this data, partic- ipants are required to train a verifier model that ranks the reasoning traces for each test claim and derives a final verdict from the top-ranked traces. To avoid leakage and ensure rigorous evaluation, we will release new test sets.

Datasets

The dataset is collected from various fact-checking domains, complete with detailed metadata and an evidence corpus sourced from the web. We will use the English dataset released in QuanTemp, as well as the 3260 Arabic and 2808 Spanish claims from the previous edition of CLEF 2025 with corresponding evidence collection. An overview of dataset statistics is shown in the Table below. Each claim in train and validation sets will comprise 20 reasoning paths generated by a LLM which can be used by participants to train a verification model for selecting best reasoning path. Apart from this, the entire evidence corpus from which the top-10 associated evidence are retrieved and re-ranked will also be provided to the users so they can run their own retrieve and re-rank pipelines for better performance.

All datasets can be found at https://gitlab.com/checkthat_lab/clef2026-checkthat-lab/-/tree/main/task2

Dataset statistics for task 2

Language	# of claims
English	8000
Spanish	2808
Arabic	3260

Evaluation

We use Recall@k and MRR@k to measure the quality of the rea- soning traces ranked by the verifier model, with k=5. For claim verification, we use macro-averaged F1 and classwise F1 scores. Participants will be judged based on high performance on both macro F1 and Recall@k. The final score is obtained by averaging ranking across these measures (macro F1 and Recall@k, where k = 5)

Submission

Scorer, Format Checker, and Baseline Scripts

TBA

Submission Site

TBA

Submission Guidelines

TBA

Leaderboard

TBA

Organizers

Venktesh Viswanthan, Stockholm University
Vinay Setty, University of Stavanger
Avishek Anand, TU Delft
Primakov Chungkham

Contact

Contact venktesh.viswanathan@dsv.su.se for any questions.

CheckThat! Lab at CLEF 2026

Contents