Developer. Researcher

AlignGPT: GPT-4 for alignment research

I have been working with a small group on a project that was designed by Leon Lang, as described in this post on LessWrong. This post is about 6000 tokens long and would not fit into the context size of gpt-3.5, so I decided to try out gpt-4.


I believe the initial summary by GPT-4 is useful and it would have been nice to have access to this when I first read the post. It found some of the same criticisms that we also had, especially the part about the distributional shift. It suggested some ways to solve the issue like domain adaptation, domain randomization, or adversarial training. Some of the terms were new to me, for example Curriculum learning. So this could definitely be used to get an overview of existing research and useful techniques. In general, the current plan for the project has shifted quite far from the original post and much of this is not really relevant to the new plan. Overall, these tools could help to accelerate the development and implementation of technical solutions in alignment projects:

  • Having a conversational agent than can explain posts and papers
  • Having an automated peer review agent to streamline the process and improve the quality of feedback provided to authors
  • Having a tool that can help with the engineering of alignment. Overall, this tool could help to accelerate the development and implementation of technical solutions in alignment projects, while also improving the quality and reliability of the final product. Existing projects could be scanned for bugs and testing could be added automatically.

GPT-4 Peer review

I used the GPT-4 API to create a paper review helper and got the following response to the post.

This text presents an interesting idea for an experiment that aims to test the convergent drive of modern model-based reinforcement learning (RL) agents to evade shutdownability, a concept related to corrigibility. The author provides a detailed AGI training story, which is divided into three phases: training the RL agent for general world modeling, training the agent to achieve goals and stop moving when it hears a sound, and deploying the agent in the real world where the existential failure appears.

The author also discusses potential experimental details using MuZero, a model-based RL algorithm, and raises several caveats and further thoughts on the proposed experiment. The text concludes by comparing the proposed setup to the original shutdown problem and how it fits into the Shutdown Problem Contest.

Overall, the text is well-structured and presents a thought-provoking idea. However, there are a few areas that could be improved or clarified:

  1. The author acknowledges their lack of professional experience in reinforcement learning, which may affect the feasibility of the proposed experiments. It would be beneficial to consult with experts in the field to refine the experimental design and ensure its feasibility.

  2. The text could benefit from a more thorough literature review to address the author’s uncertainties and to ensure that the proposed experiments are novel and relevant to the field.

  3. The author mentions that they may not be the best person to work on the experiments due to their comparative advantage. It would be helpful to clarify what specific skills or expertise they believe would be necessary to successfully carry out the proposed experiments.

  4. The text could benefit from a more concise presentation of the AGI training story and the technical details of the proposed experiments. Some sections are quite lengthy and could be shortened without losing essential information.

  5. The author could consider discussing alternative RL algorithms or approaches that might be more suitable for the proposed experiments, in case MuZero is not the best choice.

In conclusion, the text presents an intriguing idea for an experiment related to corrigibility and shutdownability in reinforcement learning agents. With some refinements and consultation with experts in the field, this proposal could potentially contribute valuable insights to the machine learning community.

In point 4, I suggested that the presentation of the AGI training story and the technical details of the proposed experiments could be more concise. The concern is not about missing technical details but rather about the length and complexity of the sections describing the training story and the experimental setup.

To address this concern, the author could consider the following:

  1. Summarize the key points of each phase in the AGI training story more concisely, focusing on the most important aspects of the agent’s learning and behavior.

  2. Simplify the presentation of the technical details, such as the state and action spaces, the transition kernel, and the reinforcement learning process. This could involve using more straightforward language, providing clear examples, or using visual aids like diagrams or flowcharts to illustrate the concepts.

  3. Organize the technical details in a more structured manner, using headings and subheadings to guide the reader through the different components of the experimental setup.

By making these adjustments, the author can make the AGI training story and the technical details of the proposed experiments more accessible and easier to understand for readers who may not be familiar with reinforcement learning or the specific algorithms and concepts discussed in the text.

The proposed training process in the text is already quite detailed and well-thought-out. However, here are a few suggestions that could potentially improve the training process:

  1. Curriculum learning: Instead of directly training the agent on complex tasks, consider gradually increasing the difficulty of the tasks during the training process. This approach, known as curriculum learning, can help the agent learn more effectively by building on simpler skills before tackling more challenging problems.

  2. Transfer learning: To speed up the training process and leverage existing knowledge, consider using transfer learning techniques. Pretrain the agent on related tasks or environments and then fine-tune it on the specific tasks of interest. This approach can help the agent generalize better and learn more efficiently.

  3. Exploration strategies: To encourage the agent to explore the environment more effectively, consider using advanced exploration strategies such as intrinsic motivation, curiosity-driven exploration, or entropy regularization. These techniques can help the agent discover novel states and actions, leading to better learning outcomes.

  4. Multi-task learning: Train the agent on multiple tasks simultaneously to encourage the development of general skills and representations that can be applied across different tasks. This approach can help the agent generalize better and potentially improve its ability to handle new tasks during deployment.

  5. Robustness to distributional shift: To ensure that the agent can handle changes in the environment or task distribution during deployment, consider incorporating techniques that promote robustness to distributional shift, such as domain adaptation, domain randomization, or adversarial training.

  6. Human-in-the-loop training: To better align the agent’s behavior with human values and expectations, consider incorporating human feedback or demonstrations into the training process. Techniques such as inverse reinforcement learning, learning from demonstrations, or reward modeling can help the agent learn from human expertise and preferences.

These suggestions can be considered and adapted based on the specific reinforcement learning algorithm, environment, and goals of the proposed experiments. By incorporating some of these techniques, the training process may become more effective, leading to better generalization and corrigibility in the trained agent.