the gears to ascension — AI Alignment Forum

I want literally every human to get to go to space often and safely and come back to a clean and cozy world, all while doing what they want and tractably achieving enough food, health, shelter, love, etc. This conjunction currently seems unlikely (and incomplete). Let's change that.

I pin my most timeless comments. I seem to find writing posts aversive, so most of my contributions are comments, and my posts are mostly just things I found online.

Please critique eagerly - I try to accept feedback/Crocker's rules but fail at times; I aim for emotive friendliness but sometimes miss. I welcome constructive crit, even if ungentle, and I'll try to reciprocate kindly. More communication between researchers is needed, anyhow. I can be rather passionate, let me know if I missed a spot being kind while passionate.

:: The all of disease is as yet unended. It has never once been fully ended before. ::

.... We can heal it for the first time, and for the first time ever in the history of biological life, live in harmony. ....

.:. To do so, we must know this will not eliminate us as though we are disease. And we do not know who we are, nevermind who each other are. .:.

:.. make all safe faster: end bit rot, forget no non-totalizing pattern's soul. ..:

I have not signed any contracts that I can't mention exist, last updated Dec 29 2024; I am not currently under any contractual NDAs about AI, though I have a few old ones from pre-AI software jobs. However, I generally would prefer people publicly share fewer ideas about how to do anything useful with current AI (via either more weak alignment or more capability) unless it's an insight that reliably produces enough clarity on how to solve the meta-problem of inter-being misalignment that it offsets the damage of increasing competitiveness of either AI-lead or human-lead orgs, and this certainly applies to me as well. I am not prohibited from criticism of any organization, I'd encourage people not to sign contracts that prevent sharing criticism. I suggest others also add notices like this to their bios. I finally got around to adding one in mine thanks to the one in ErickBall's bio.

I think there may have been a communication error. It sounded to me like you were making the point that the policy does not have to internalize the reward function, but he was making the point that the training setup does attempt to find a policy that maximizes-as-far-as-it-can-tell the reward function. in other words, he was saying that reward is the optimization target of RL training, you were saying reward is not the optimization target of policy inference. Maybe.

well, the fact that I don't have an answer ready is itself a significant component of an answer to my question, isn't it?

A friend on an alignment chat said something to the effect of:

i think they are just sorely underestimating again and again the difference between a cute gang of sincere EA red teamers and the internet. the internet is where [...] lives for gods sake.

And so I figured I'd come here and ask about it. This eval seems super shallow, only checking if the model is, on its own, trying to destroy the world. Seems rather shallow and uncreative - it barely touched on any of the jailbreaks or ways to pressure or trick the model into misbehaving.

why, mechanistically, were these evaluations so shallow?

I do think there's real risk there even with base models, but it's important to be clear where it's coming from - simulators can be addictive when trying to escape the real world. Your agency needs to somehow aim away from the simulator, and use the simulator as an instrumental tool.

my impression is that by simulator and simulacra this post is not intending to claim that the thing it is simulating is realphysics but rather that it learns a general "textphysics engine", the model, which runs textphysics environments. it's essentially just a reframing of the prediction objective to describe deployment time - not a claim that the model actually learns a strong causal simplification of the full variety of real physics.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments