Information

The course registration period is open. Register for semester 1 courses before Monday, 16 June at 13:00.

student.uva.nl
What is your study programme?
UvA Logo
What is your study programme?
Information

The course registration period is open. Register for semester 1 courses before Monday, 16 June at 13:00.

Colloquium credits

Presentation Master's thesis - Lucca Pfründer - Clinical Psychology

Colloquium credits

Presentation Master's thesis - Lucca Pfründer - Clinical Psychology

Last modified on 11-06-2025 14:37
Misleading Deception Classifiers With Model-Based and Human Paraphrasing Attacks
Show information for your study programme
What is your study programme?
or
event-summary.start-date
17-06-2025 15:30
event-summary.end-date
17-06-2025 16:30
event-summary.location

Roeterseilandcampus - Gebouw A, Straat: Nieuwe Achtergracht 129-B, Ruimte: A2.08. Vanwege beperkte zaalcapaciteit is deelname op basis van wie het eerst komt, het eerst maalt. Leraren moeten zich hieraan houden.

Automated models often outperform humans at detecting deception but remain vulnerable to adversarial attacks—subtle alterations of statements (i.e., words or phrases) that preserve meaning but change the classification of a model. After training a DistilBERT classifier on 80 percent of the statements from a dataset of autobiographical truths and lies (Hippocorpus), humans and GPT-4o each rewrote 153 test statements (of the remaining 20%) up to 10 times, attempting to flip the model’s prediction. This can be understood as a paraphrasing attack, the statement is rewritten in a way that it means the same, but in an attempt to fool the classifier. Nearly 70% of paraphrased statements succeeded in changing the model’s prediction (i.e., lie to truth or truth to lie). While humans and LLM were similarly effective and efficient overall, humans induced a greater change in model confidence and in fewer iterations for truthful statements. This highlights a key vulnerability: Models can be tricked through benign changes without changing the underlying content.