AI ALIGNMENT FORUM
AF

All of Johannes Treutlein's Comments + Replies

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

9mo20

Next, I took one of the finetunes and functions where OOD performance wasn't perfect. I choose 1.75 x and my first functions finetune (OOD performance at 82%). Below, I plot the function values that th... (read more)

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Johannes Treutlein

9mo30

My guess is that for any given finetune and function, OOD regression performance correlates with performance on providing definitions, but that the model doesn't perform better on its own provided definitions than on the ground truth definitions. From looking at plots of function values, the way they are wrong OOD often looked more like noise or calculation errors to me rather than eg getting the coefficient wrong. I'm not sure, though. I might run an evaluation on this soon and will report back here.

2Johannes Treutlein9mo

I played around with this a little bit now. First, I correlated OOD performance vs. Freeform definition performance, for each model and function. I got a correlation coefficient of ca. 0.16. You can see a scatter plot below. Every dot corresponds to a tuple of a model and a function. Note that transforming the points into logits or similar didn't really help. Next, I took one of the finetunes and functions where OOD performance wasn't perfect. I choose 1.75 x and my first functions finetune (OOD performance at 82%). Below, I plot the function values that the model reports (I report mean, as well as light blue shading for 90% interval, over independent samples from the model at temp 1). This looks like a typical plot to me. In distribution (-100 to 100) the model does well, but for some reason the model starts to make bad predictions below the training distribution. A list of some of the sampled definitions from the model: Unsurprisingly, when checking against this list of model-provided definitions, performance is much worse than when evaluating against ground truth. It would be interesting to look into more different functions and models, as there might exist ones with a stronger connection between OOD predictions and provided definitions. However, I'll leave it here for now.

Conditioning Predictive Models: The case for competitiveness

Johannes Treutlein

2y32

Yes, one could e.g. have a clear disclaimer above the chat window saying that this is a simulation and not the real Bill Gates. I still think this is a bit tricky. E.g., Bill Gates could be really persuasive and insist that the disclaimer is wrong. Some users might then end up believing Bill Gates rather than the disclaimer. Moreover, even if the user believes the disclaimer on a conscious level, impersonating someone might still have a subconscious effect. E.g., imagine an AI friend or companion who repeatedly reminds you that they are just an AI, versus ... (read more)

rohinmshah's Shortform

Johannes Treutlein

2y30

My takeaway from looking at the paper is that the main work is being done by the assumption that you can split up the joint distribution implied by the model as a mixture distribution

P = α P_{0} + (1 - α) P_{1},

such that the model does Bayesian inference in this mixture model to compute the next sentence given a prompt, i.e., we have $P (s ∣ s_{0}) = \frac{P (s \otimes s_{0})}{P (s_{0})}$ . Together with the assumption that $P_{0}$ is always bad (the sup condition you talk about), this makes the whole approach with giving more and more evidence for $P_{0}$ by stringing together bad se... (read more)

Acausal trade: being unusual

Johannes Treutlein

2y70