Definition
This task performs test-time scaling for fact-verification. Given a claim, relevant evidences and reasoning traces with corresponding verdicts, the task is to rank the reasoning traces based on their utility in leading to the correct verdict and also output the final verdict from the top-ranked reasoning traces. The claims are available in English, Spanish and Arabic. The task focuses on verifying naturally occurring claims that contain quan- titative quantities and/or temporal expressions by improving the reasoning ca- pabilities of Large Language Models (LLMs) through test-time scaling. Unlike previous editions, which primarily focused only on fact-checking accuracy, this year’s task explicitly incorporates rationale generation for performing reasoning into the evaluation. This is achieved in the form of re-ranking reasoning traces from the LLM, assessing both the correctness of the predicted veracity and the quality of the underlying reasoning traces. While the claims are drawn from the previous iteration of the task, the overall setup differs substantially. Specifically, we introduce a test-time scaling frame- work designed to enhance LLM reasoning for claim verification. Given a claim and its associated evidence, multiple reasoning traces are generated using an LLM with varying temperature values to induce diversity. Redundant traces are subsequently removed through a de-duplication step. Based on this data, partic- ipants are required to train a verifier model that ranks the reasoning traces for each test claim and derives a final verdict from the top-ranked traces. To avoid leakage and ensure rigorous evaluation, we will release new test sets.