Upcoming Talk with Michael Aird @ RAND!

Research

Our broad purpose is to address emergent risks from advanced AI systems. To this end, we build and support multidisciplinary teams focused on fields including:

Interpretability: Seeks to make the decision-making processes of AI systems transparent and understandable to humans.

Robustness: Pertains to the resilience and stability of AI systems in the face of adversarial attacks, novel inputs, or changing environments.

Governance: Delves into the frameworks, policies, and regulations guiding AI development and deployment.

Select Publications

Alignment Faking Revisited: Improved Classifiers and Extensions | Anthropic Science Blog 2025

Patterns and Mechanisms of Contrastive Activation Engineering | ICLR 2025 Workshops

Scaling Sparse Feature Circuits For Studying In-Context Learning | ICLR 2025 Workshop

Interpreting Large Text-to-Image Diffusion Models with Dictionary Learning | CVPR 2025 Workshop

Robust Unlearning via Mechanistic Localizations | ICLR 2024 Workshop Spotlight

A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task | ACL 2024 Main Conference

A full publication list can be found on our research tracker.

Current Projects

Mitigating AI-fueled catastrophic biorisk in Liberia

In collaboration with the Office of General Counsel of the Ministry of Health of the Republic of Liberia, we develop a regulatory framework aimed at mitigating biosecurity threats from advanced artificial intelligence. We assume minimal enforcement resources and focus on red lines most applicable to catastrophic risks faced by Liberian citizens.

Jailbreaking vision language models

We examine how fine-tuning Vision-Language Models for downstream tasks increases vulnerability to adversarial attacks and jailbreaking across text-only, image-only, and multimodal attack vectors. We investigate which specific tasks create greater attack surfaces, identify unique failure modes post-fine-tuning, and explore connections between representational drift and adversarial robustness. The work aims to develop alignment steering approaches that can recover robustness without sacrificing task performance.

Investigating the bystander effect in multi-agent systems

We empirically investigate whether LLMs exhibit social psychology phenomena like the bystander effect, conformity, and diffusion of responsibility when deployed in multi-agent scenarios with ethical dimensions. We aim to understand collective dynamics beyond individual LLM behavior, particularly focusing on responsibility attribution and moral judgment failures that could cause harm in high-stakes applications. This work seeks to establish methodological foundations for evaluating psychological-like traits in AI agents to inform alignment research.

Comparative study of global deepfake regulation

We extract deepfake-relevant clauses from major policies passed globally and consolidate them into a comprehensive deepfake policy index. We ystematically analyze what distinguishes illegal deepfakes from allowable edited content, evaluate how existing policies address various risks, and examine real-world case studies to assess policy interpretation and outcomes. The findings will identify policy convergences and develop recommendations for creating more harmonized and effective deepfake governance frameworks.

Experimenting with automated model diffing

We use model diffing with BatchTopK crosscoders to mechanistically understand how instruction-tuning changes LLM behavior, focusing on how models distinguish instructions from other text, handle conflicting directives, and reject unsafe requests. We investigate internal mechanisms behind instruction recognition, hierarchical processing, rejection decisions, and adversarial robustness to jailbreaking attacks. We seek to provide foundational insights for alignment researchers working with instruction-tuned models while prioritizing safety and robustness.

In-training crosscoder analysis of RL agents

We train RL policies using behavior cloning, capturing intermediate activations at each gradient step. By employing a sparse crosscoder, we intend to reconstruct these activations to identify training features, ranging from primitive to complex ones. Additionally, we aim to detect features related to unsafe trajectories, thereby minimizing harms during deployment. Through this work, we hope to provide insights into the efficiency and safety of training DeepRL systems and contribute an open dataset for further research.

Research Opportunities

Join impactful projects alongside AISI researchers aimed at a workshop or arXiv publication. This opportunity is offered to select AI Safety Fellowship graduates and by application. Feel free to reach out to board@aisi.dev if you'd like to inquire about this opportunity. Include your resume/CV and a short description of your interest in AI safety research.

For external opportunities, see Careers.

We are coordinating with College of Computing and ML@GT faculty to mentor promising AISI research groups alongside the Research Option (RO) for undergraduates at GT. Details forthcoming!

Keep updated »

Report abuse