Comparative study of global deepfake regulation
We extract deepfake-relevant clauses from major policies passed globally and consolidate them into a comprehensive deepfake policy index. We ystematically analyze what distinguishes illegal deepfakes from allowable edited content, evaluate how existing policies address various risks, and examine real-world case studies to assess policy interpretation and outcomes. The findings will identify policy convergences and develop recommendations for creating more harmonized and effective deepfake governance frameworks.
Experimenting with automated model diffing
We use model diffing with BatchTopK crosscoders to mechanistically understand how instruction-tuning changes LLM behavior, focusing on how models distinguish instructions from other text, handle conflicting directives, and reject unsafe requests. We investigate internal mechanisms behind instruction recognition, hierarchical processing, rejection decisions, and adversarial robustness to jailbreaking attacks. We seek to provide foundational insights for alignment researchers working with instruction-tuned models while prioritizing safety and robustness.