Current Projects
Jailbreaking vision language models
We examine how fine-tuning Vision-Language Models for downstream tasks increases vulnerability to adversarial attacks and jailbreaking across text-only, image-only, and multimodal attack vectors. We investigate which specific tasks create greater attack surfaces, identify unique failure modes post-fine-tuning, and explore connections between representational drift and adversarial robustness. The work aims to develop alignment steering approaches that can recover robustness without sacrificing task performance.
Investigating the bystander effect in multi-agent systems
We empirically investigate whether LLMs exhibit social psychology phenomena like the bystander effect, conformity, and diffusion of responsibility when deployed in multi-agent scenarios with ethical dimensions. We aim to understand collective dynamics beyond individual LLM behavior, particularly focusing on responsibility attribution and moral judgment failures that could cause harm in high-stakes applications.
iām not sure which of the other projects are still on-going. if there are not that many, can probably lump this page together w previous research. also if we have anything we can link or pictures that would be nice?