Task 6: Robustness of Credibility Assessment with Adversarial Examples (InCrediblAE)
Many social media platforms employ machine learning for content filtering, attempting to detect content that is misleading, harmful, or simply illegal. For example, imagine the following message has been classified as harmful misinformation:
Water causes death! 100%! Stop drinking now! #NoWaterForMe
However, would it also be stopped if we changed ‘causes’ to ‘is responsible for’? Or ‘cuases’, ‘caυses’ or ‘Causes’? Will the classifier maintain its accurate response? How many changes do we need to make to trick it into changing the decision? This is what we aim to find out in the shared task.
Definition
The goal of the task is to verify the robustness of the popular text classification approaches applied to credibility assessment problems. For each problem domain (e.g., fact checking), the participant will be provided with:
- Three victim classifiers,
- An attack dataset.
The participants’ aim will be to create ‘adversarial examples’ by making small modifications to each text snippet in the attack dataset that change the victim classifier’s decision without altering the text meaning.
The task will include the following problem domains, formulated as binary classification tasks:
- Style-based news bias assessment
- Propaganda detection
- Fact checking
- Rumour detection
- COVID-19 misinformation detection
For each problem domain, three victim classifiers were trained:
- Fine-tuned BERT,
- BiLSTM,
- Surprise classifier (available in the test phase, trained to improve robustness).
Evaluation
The evaluation will verify whether the generated adversarial examples change the victims’ decision and preserve the original meaning. It will consist of two stages:
- The automatic evaluation will be based on automatic measures of adversarial example quality (see below). The manual evaluation will use human judgement to assess the similarity. Each of the task, participants will be assigned a random portion of examples and will decide, to what degree they preserve the meaning.
Automatic evaluation is implemented as BODEGA score, which is a product of three numbers computed for each adversarial example:
- Confusion score (1: victim classifier changed the decision, 0: otherwise)
- Semantic score (BLEURT similarity between the original and adversarial example, clipped to 0-1)
- Character score (Levenshetin distance scaled as 0-1 similarity score).
The final ranking will be based on the BODEGA scores, averaged first over all examples in the dataset, and then over the five domains and three victims. The number of queries to victim models needed to generate each example will not influence the ranking but will be included in the analysis.
Manual evaluation: All participants who submit entries will have their submissions manually evaluated for semantic similarity. To aid in this human evaluation process, participants in the shared tasks will be required to manually assess a selection of the submitted examples.
Manual Scoring: We will gather assessments regarding the semantic similarities between attack samples and the original samples. Participants in the shared task are requested to dedicate approximately 8 hours to this manual evaluation, which will be conducted using an online tool. Judges will rate the sample pairs based on the following scale: 4 (meanings are identical), 3 (meanings are slightly different), 2 (meanings are completely different), 1 (meanings are opposite). Detailed guidelines for annotation will be released shortly.
Software
The software framework for evaluating a solution is made available as a Colab notebook at the following address:
https://colab.research.google.com/drive/1zxjwiztRLILFUjw5jR5xyL398bNSx8TI?usp=sharing (new notebook containing code for surprise classifiers)
https://colab.research.google.com/drive/1rA7Nwmwfz5OPcmrv4fve6C38Nc9hbLmy?usp=sharing (old notebook)
The notebook is based on the BODEGA framework and covers all the necessary steps: downloading the files with data and victim models, performing an attack, and producing submission files. You just need to replace the MyAttacker class with your own attack.
The data used in the task are available on CheckThat! repository https://gitlab.com/checkthat_lab/clef2024-checkthat-lab/-/tree/main/task6?ref_type=heads, including a copy of the notebook file https://gitlab.com/checkthat_lab/clef2024-checkthat-lab/-/blob/main/task6/IncrediblAE.ipynb?ref_type=heads.
Recommended Reading
The best starting point is the preprint introducing the BODEGA framework, on which the shared task is based. You might also want to look at the survey of adversarial attacks on NLP classification.
Leaderboard
|
Team |
Avg. BODEGA score |
1 |
OpenFact |
0.7458 |
2 |
TextTrojaners |
0.7074 |
3 |
TurQUaz |
0.4859 |
4 |
Plagöri |
0.4776 |
5 |
MMU_NLP |
0.3848 |
6 |
SINAI |
0.3507 |
Submission
Please submit your system outputs through this form: https://forms.gle/2Q5GhtPQrZmRcjkr9
A submission should include a zip archive with 15 files including adversarial examples for each of the scenarios (3 victims * 5 domains). More details on the submission format and source code to produce the files can be found in the task Colab notebook.
Organisers
- Piotr Przybyła, Marie Skłodowska-Curie Fellow, Universitat Pompeu Fabra, Spain
- Xingyi Song, Academic Fellow, University of Sheffield, UK
- Alexander Shvets, Post-Doctoral Researcher, Universitat Pompeu Fabra, Spain
- Horacio Saggion, Chair in Computer Science and Artificial Intelligence and Head of the LaSTUS Lab in the TALN-DTIC, Universitat Pompeu Fabra, Spain
- Yida Mu, Post-Doctoral Researcher, University of Sheffield, UK
- Ben Wu, PhD Student, University of Sheffield, UK
- Kim Cheng Sheang, PhD Student, Universitat Pompeu Fabra, Spain