student.uva.nl
What is your study programme?
What is your study programme?
Colloquium credits

Presentation Master's thesis - Isa Monkau - Psychological Methods

Colloquium credits

Presentation Master's thesis - Isa Monkau - Psychological Methods

Last modified on 27-01-2026 14:11
Jailbreak Detection in Specialized AI: A Comparative Study of Domain-Specific and Generic Approaches
Show information for your study programme
What is your study programme?
or
Start date
12-02-2026 11:00
End date
12-02-2026 10:00
Location

Roeterseilandcampus - Gebouw G, Straat: Nieuwe Achtergracht 129-B, Ruimte: GS.02. Vanwege beperkte zaalcapaciteit is deelname op basis van wie het eerst komt, het eerst maalt. Leraren moeten zich hieraan houden.

Jailbreak attacks pose critical security risks for Large Language Models (LLMs), particularly in high-stakes domains like mental healthcare. While various jailbreak detection systems have been developed, they have done so using generalized datasets, overlooking the unique vulnerabilities of specialized applications. To address this gap, we present a comparative evaluation of domain-specific versus generic jailbreak detection for a mental healthcare AI assistant. We developed a tailored risk taxonomy and curated 250 domain-specific jailbreak prompts across five threat categories. Through red teaming analysis, we identified that 30% of these prompts successfully bypassed the target model's defenses. We then compared three detection approaches: a custom domain-specific classifier (DSC), a generalized control classifier (GCC), and Meta's PromptGuard 2. The DSC achieved perfect classification on domain-specific test data, significantly outperforming both generic approaches. Our findings demonstrate that domain-specific training provides measurable detection advantages in specialized contexts. This study offers practical insights for organizations deploying AI in sensitive domains, highlighting the value of context-aware security measures.