Research
We host a research program for those who have an advanced understanding of AI safety and strong technical experience.
News: AISI Researchers Selected for Top ML Conferences
We host a research program for those who have an advanced understanding of AI safety and strong technical experience.
We are partnering with EleutherAI, a leading AI research non-profit, to conduct research in AI alignment. This is a very exciting opportunity! Eleuther has become highly respected in the field for their open-source models (ex. Stable Diffusion, GPT-J, OpenFold) and their publications in top venues (ex. NeurIPS, ACL, ICLR).
Participants will spend 5-15 hours a week working on their project under the supervision of Curtis Huebner, the head of alignment research at Eleuther. Participants will collaborate with cohorts of about 5 members.
Last call for applications! Program start dates will be adjusted to accommodate participant schedules. Our expectation is for projects to begin mid-May and wrap up mid-August.
[Application]This program will be conducting remotely. In the application, participants will indicate their project preference and availability. Based on this information, we will create cohorts of about 5 members working different projects. Participants will be expected to work at least 5 hours a week.
This program will adopt the capstone model. First, participants will put together a project proposal. During the program, they will keep a log book of their progress, working towards a final report for the project.
Participants will briefly report on their work weekly with program managers. Meetings will alternative between Gaurav Sett, the director of AISI, and Curtis Huebner, the head of alignment research at Eleuther.
We have identified a few project areas that participants may work on.
1. Recreating Neel Nanda’s interpretability library for other frameworks.
Neel Nanda has written the TransformerLens library along with some demonstrations of it’s utility. However, it’s a PyTorch library, and there isn’t really an equivalent for Jax. This project would involve creating a similar library. A lot of the challenge in this project will come from designing the library well, and making it capable of interacting with the specific oddities of the Jax framework.
Potential questions:
What should be the scope of this project? Do we want a generic utility for any Jax code, or do we want specific functionality for individual libraries that build on top of it?
Does the specific way that Jax work under the hood pose any particular challenges to implementation?
What should the interface look like?
What are interpretabilty researchers looking for in an interpretabilty library?
How should you go about communicating this project to the broader community?
Background reading:
2. Looking at language models through the lens of Markov chains
Language models, at least the public ones, are generally trained with a log likelihood loss objective, which corresponds to maximising the probability mass assigned to the real data. In the limit, this objective mirrors the underlying data-generating process. However, real systems that we train are not perfect, even the really big ones. Due to the training objective, we can be reasonably confident that the output of an LM in generative mode will stay reasonably on target, at least for short periods of time. We however cannot make similar guarantees in the limit of recursively feeding model output back to itself. This project aims to study how language models behave as they start begin to dogfood their own output. The project has a very wide potential scope.
Potential questions:
Should the narrow situation where LLMs consume their own output be studied, or should more complicated setups be investigated?
The literature on markov chains goes deep, are there any theorems we can use to make statements about LM drift?
What can we say about the stationary distribution of LMs consuming their own output?
If I wanted an LM that had a stationary distribution that matched a given training distribution, how would you go about that? Is the existing training objective sufficient, or do modifications need to be made?
Background reading:
3. Unpaired text-image generation.
AI image generation has undergone a revolution thanks to scaling. Controllable image generation makes for a nice toy box for studying some of the arguably easier parts of ai alignment. Specifically, it gives us a simple example of using the text modality to control another modality. This, for what it is, works extremely well. However, this is accomplished by leveraging the enormous amount of image-text pairs present on the internet. This is not generally the case with other modalities. What this project aims to do is investigate techniques for reproducing the capabilities of existing image generators using (mostly) unpaired data.
Potential questions:
Has this been done before? What does the landscape of existing techniques look like.
What methods should be evaluated?
Is it possible to get *completely* unpaired text->image generation? What are it’s limitations?
How are you going to evaluate the quality of text->image generation?
How does model scaling fit into this picture? Do certain techniques start/stop working at different scales?
Background reading:
4. Participant proposals
If you have an idea for a potential project, we'd love to hear it! For ideas, see Eleuther's research agenda.