This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Wikitags
AF
Login
Redwood Research
Settings
Applied to
Will alignment-faking Claude accept a deal to reveal its misalignment?
by
Ryan Greenblatt
3mo
ago
Applied to
A sketch of an AI control safety case
by
Tomek Korbak
3mo
ago
Applied to
How will we update about scheming?
by
Ryan Greenblatt
4mo
ago
Applied to
Polysemanticity and Capacity in Neural Networks
by
Dakara
4mo
ago
Applied to
A quick experiment on LMs’ inductive biases in performing search
by
Dakara
4mo
ago
Applied to
Balancing Label Quantity and Quality for Scalable Elicitation
by
Dakara
4mo
ago
Applied to
Coup probes: Catching catastrophes with probes trained off-policy
by
Dakara
4mo
ago
Applied to
Measurement tampering detection as a special case of weak-to-strong generalization
by
Dakara
4mo
ago
Applied to
Toy models of AI control for concentrated catastrophe prevention
by
Dakara
4mo
ago
Applied to
Notes on control evaluations for safety cases
by
Dakara
4mo
ago
Applied to
[Paper] Stress-testing capability elicitation with password-locked models
by
Dakara
4mo
ago
Applied to
Catching AIs red-handed
by
Dakara
4mo
ago
Applied to
Improving the Welfare of AIs: A Nearcasted Proposal
by
Dakara
4mo
ago
Applied to
Preventing Language Models from hiding their reasoning
by
Dakara
4mo
ago
Applied to
Preventing model exfiltration with upload limits
by
Dakara
4mo
ago
Applied to
Untrusted smart models and trusted dumb models
by
Dakara
4mo
ago
Applied to
Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation
by
Dakara
4mo
ago
Applied to
The case for ensuring that powerful AIs are controlled
by
Dakara
4mo
ago
Applied to
Access to powerful AI might make computer security radically easier
by
Dakara
4mo
ago
Applied to
A basic systems architecture for AI agents that do autonomous research
by
Dakara
4mo
ago