Definition
This task involves verifying naturally occurring claims containing numerical quantities and temporal expressions by improving the reasoning process of Large Language Models (LLMs) through test-time scaling. In contrast to previous editions that focused solely on fact-checking accuracy, this year’s task integrates rationale generation into the evaluation, assessing both the correctness of the veracity prediction and the reasoning quality of the model’s explanation. Though the claims to be fact-checked are leveraged from the previous version, the task setup differs from the last iteration. In this version of the task, we propose a test-time scaling setup to improve the performance of LLM reasoning for claim verification. Given claims, each with associated top-10 evidences, possible reasoning paths are generated and provided as input. Given this data participants are expected to train a verifier model that can help rank the reasoning paths for test claims along with output of verdict from top-ranked reasoning path. We will also release new evaluation (test) sets to avoid leakage and for rigorous evaluation