AI ALIGNMENT FORUM
Wikitags
AF

Redwood Research

Settings

Applied to Will alignment-faking Claude accept a deal to reveal its misalignment? by Ryan Greenblatt 3mo ago

Applied to A sketch of an AI control safety case by Tomek Korbak 3mo ago

Applied to How will we update about scheming? by Ryan Greenblatt 4mo ago

Applied to Polysemanticity and Capacity in Neural Networks by Dakara 4mo ago

Applied to A quick experiment on LMs’ inductive biases in performing search by Dakara 4mo ago

Applied to Balancing Label Quantity and Quality for Scalable Elicitation by Dakara 4mo ago

Applied to Coup probes: Catching catastrophes with probes trained off-policy by Dakara 4mo ago

Applied to Measurement tampering detection as a special case of weak-to-strong generalization by Dakara 4mo ago

Applied to Toy models of AI control for concentrated catastrophe prevention by Dakara 4mo ago

Applied to Notes on control evaluations for safety cases by Dakara 4mo ago

Applied to [Paper] Stress-testing capability elicitation with password-locked models by Dakara 4mo ago

Applied to Catching AIs red-handed by Dakara 4mo ago

Applied to Improving the Welfare of AIs: A Nearcasted Proposal by Dakara 4mo ago

Applied to Preventing Language Models from hiding their reasoning by Dakara 4mo ago

Applied to Preventing model exfiltration with upload limits by Dakara 4mo ago

Applied to Untrusted smart models and trusted dumb models by Dakara 4mo ago

Applied to Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation by Dakara 4mo ago

Applied to The case for ensuring that powerful AIs are controlled by Dakara 4mo ago

Applied to Access to powerful AI might make computer security radically easier by Dakara 4mo ago

Applied to A basic systems architecture for AI agents that do autonomous research by Dakara 4mo ago