Research

We build and support multidisciplinary teams focused on:

  • Interpretability: Makes the decision-making processes of AI systems transparent and understandable to humans

  • Robustness: Pertains to the resilience of AI systems facing adversarial attacks or novel environments

  • Governance: Delves into the frameworks, policies, and regulations guiding AI development and deployment

Contribute to our research »

Select Publications by AISI members

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
Anthropic Science Blog · 2026
Abhay Sheshadri

An Independent Safety Evaluation of Kimi K2.5‍
Preprint · 2026
Parv Mahajan

Alignment Faking Revisited: Improved Classifiers and Extensions
Anthropic Science Blog · 2025
Abhay Sheshadri

Patterns and Mechanisms of Contrastive Activation Engineering
ICLR Workshop · 2025
Yixiong Hao, Ayush Panda, Stepan Shabalin

Scaling Sparse Feature Circuits For Studying In-Context Learning‍ ‍
ICLR Workshop · 2025
Stepan Shabalin

Interpreting Large Text-to-Image Diffusion Models with Dictionary Learning
CVPR Workshop · 2025
Stepan Shabalin, Ayush Panda, Yixiong Hao

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
arXiv preprint · 2024
Abhay Sheshadri

Robust Unlearning via Mechanistic Localizations
ICLR Workshop Spotlight · 2024
Abhay Sheshadri

A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task
ACL Main Conference · 2024
Abhay Sheshadri

A full publication list can be found on our research tracker.

Current Projects

Jailbreaking vision language models

We examine how fine-tuning Vision-Language Models for downstream tasks increases vulnerability to adversarial attacks and jailbreaking across text-only, image-only, and multimodal attack vectors. We investigate which specific tasks create greater attack surfaces, identify unique failure modes post-fine-tuning, and explore connections between representational drift and adversarial robustness. The work aims to develop alignment steering approaches that can recover robustness without sacrificing task performance.

Investigating the bystander effect in multi-agent systems

We empirically investigate whether LLMs exhibit social psychology phenomena like the bystander effect, conformity, and diffusion of responsibility when deployed in multi-agent scenarios with ethical dimensions. We aim to understand collective dynamics beyond individual LLM behavior, particularly focusing on responsibility attribution and moral judgment failures that could cause harm in high-stakes applications.