Research
We build and support multidisciplinary teams focused on:
Interpretability: Makes the decision-making processes of AI systems transparent and understandable to humans
Robustness: Pertains to the resilience of AI systems facing adversarial attacks or novel environments
Governance: Delves into the frameworks, policies, and regulations guiding AI development and deployment
Select Publications by AISI members
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
Anthropic Science Blog · 2026
Abhay Sheshadri
An Independent Safety Evaluation of Kimi K2.5
Preprint · 2026
Parv Mahajan
Alignment Faking Revisited: Improved Classifiers and Extensions
Anthropic Science Blog · 2025
Abhay Sheshadri
Patterns and Mechanisms of Contrastive Activation Engineering
ICLR Workshop · 2025
Yixiong Hao, Ayush Panda, Stepan Shabalin
Scaling Sparse Feature Circuits For Studying In-Context Learning
ICLR Workshop · 2025
Stepan Shabalin
Interpreting Large Text-to-Image Diffusion Models with Dictionary Learning
CVPR Workshop · 2025
Stepan Shabalin, Ayush Panda, Yixiong Hao
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
arXiv preprint · 2024
Abhay Sheshadri
Robust Unlearning via Mechanistic Localizations
ICLR Workshop Spotlight · 2024
Abhay Sheshadri
A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task
ACL Main Conference · 2024
Abhay Sheshadri
A full publication list can be found on our research tracker.
Current Projects
Jailbreaking vision language models
We examine how fine-tuning Vision-Language Models for downstream tasks increases vulnerability to adversarial attacks and jailbreaking across text-only, image-only, and multimodal attack vectors. We investigate which specific tasks create greater attack surfaces, identify unique failure modes post-fine-tuning, and explore connections between representational drift and adversarial robustness. The work aims to develop alignment steering approaches that can recover robustness without sacrificing task performance.
Investigating the bystander effect in multi-agent systems
We empirically investigate whether LLMs exhibit social psychology phenomena like the bystander effect, conformity, and diffusion of responsibility when deployed in multi-agent scenarios with ethical dimensions. We aim to understand collective dynamics beyond individual LLM behavior, particularly focusing on responsibility attribution and moral judgment failures that could cause harm in high-stakes applications.