SVD can be applied to the OV circuit and MLP weights of a transformer for interpretability. In the case of LLMs the resulting singular vectors frequently appear to be semantically interpretable by taking the cosine similarity to the vectors of the output token embedding. This work seeks to extend this line of research from language models to decision transformers. Decision transformers have a similar architecture but use action, state, reward, and time embeddings. Just like in the original post from Conjecture, I have looked at the SVD of the OV-circuit and of the first (K) and second (V) layers of the MLP. There are usually also separate output embeddings for action and state. By measuring the cosine similarity of different weight singular vectors to the different embeddings, it can be shown how different weights attend to different inputs and which outputs they produce. This approach might admit locating the origin of actions and where input data is processed in a transformer. To further confirm these results, edits are made to the weights and it is observed that the results are in line with the locations found with SVD. If there is a stronger cosine similarity between actions and singular vectors, edits to those weights cause the agent to fail. Similar edits to other parts do not cause this failure. Since this post is looking at many different embeddings, the embedding vectors have been normalized to make them more comparable. All of the results show the inner product of two 128-dimensional norm 1 vectors.

## SVD for Searching the Origin of Actions

The graphic below shows the results of SVD for the K weights. K weights are the weights of the first Linear layer in the MLP of a transformer Block. Similar to the findings in gpt2 by Conjecture, the first singular vector seems to correspond with the most common output actions. It appears, that the cosine similarities are generally high with the action output embedding.

Conversely, the V weights show much lower cosine similarities.

SVD on the K weights of the MLP compared to the Action Output Embedding appears to show some causal relationship. In comparison, the cosine similarity was much greater for the K weights than for any other part of the transformer Block. Editing the weights of the transformer MLP by setting one of the singular vectors equal to [1, 1, …] / 128 also seems to change the actions of the agent significantly. Editing the K values seems to have a very significant effect on the performance of the agent, while the effects appear much smaller for the V values. The video below shows the effects of editing the first and third singular vectors respectively on the K weights and then the V weights.

```
U, S, V = SVD(M)
V[0,:] = 1/128
M = U S V
```

## SVD for the State Input

In the original publication by Conjecture, they only looked at the output embedding for tokens, even though there seems to be no reason not to also look at the input embeddings. This way it could be possible to trace where information may be first processed.

### K Weights in the First Block

### OV-Circuit Weights in the First Block

## Reward Embedding

The strongest signal for the input reward embedding can be found in the V weights of the second transformer block. The cosine similarity is only 0.24 which is still more than one would expect randomly for length 128 embedding vector.

## Time Embedding

It was pretty unclear what the plots for the timestep embedding meant, it did appear that the V weights of the MLP in the first transformer block showed reactions to the embedding. In general, this task probably does not really require knowledge of the time to be successfully completed. That could explain why there doesn’t seem to be any structure to this.

## SVD on larger models like GPTJ-6b

I also tried to apply the SVD method to larger models like gptj-6b both before and after finetuning. Here is a notebook in the repo. While Conjecture used SVD on a gpt2-sized model in the original post, it appears that the SVD technique works significantly worse with larger models. I believe one of the reasons is the increased residual embedding size of the tokens from 1024 to 4096 from gpt2 to gptj, making it significantly harder to achieve a high cosine similarity between two vectors.

## Code

The code for this study can be found on my github. There is also a colab notebook I created to show the effects of changing singular values.

## Future Work

- Pin down what these values actually mean and how they could be used to locate and edit knowledge or capabilities.
- Editing the weight matrixes can be used to further explore where certain capabilities are residing in a model. I.e. If you predict that this part
- Discrete action space will make it easier to see such effects, I am working on getting such a model.
- Automate certain aspects that are currently done by hand. For example, the current network has only 3 transformer blocks, I distinguish manually which weights show the greatest cosine similarity.
- Automate the evaluation for the change of the agents behaviour. Currently i run the agent in simualtion and watch the agent. Ideally this could be automated to hundreds of runs and a statistical measure on how fast and often the agent fails.
- The main conclusion of the original Conjecture post was that singular vectors frequently represent semantic clusters. It is not clear how to move this to the space of decision transformers, especially for this particular continuous environment. This might show up more clearly in a discrete environment.