This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
Redwood Research
•
Applied to
Polysemanticity and Capacity in Neural Networks
by
Dakara
4d
ago
•
Applied to
A quick experiment on LMs’ inductive biases in performing search
by
Dakara
4d
ago
•
Applied to
Balancing Label Quantity and Quality for Scalable Elicitation
by
Dakara
4d
ago
•
Applied to
Coup probes: Catching catastrophes with probes trained off-policy
by
Dakara
4d
ago
•
Applied to
Measurement tampering detection as a special case of weak-to-strong generalization
by
Dakara
4d
ago
•
Applied to
Toy models of AI control for concentrated catastrophe prevention
by
Dakara
4d
ago
•
Applied to
Notes on control evaluations for safety cases
by
Dakara
4d
ago
•
Applied to
[Paper] Stress-testing capability elicitation with password-locked models
by
Dakara
4d
ago
•
Applied to
Catching AIs red-handed
by
Dakara
4d
ago
•
Applied to
Improving the Welfare of AIs: A Nearcasted Proposal
by
Dakara
4d
ago
•
Applied to
Preventing Language Models from hiding their reasoning
by
Dakara
4d
ago
•
Applied to
Preventing model exfiltration with upload limits
by
Dakara
4d
ago
•
Applied to
Untrusted smart models and trusted dumb models
by
Dakara
4d
ago
•
Applied to
Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation
by
Dakara
4d
ago
•
Applied to
The case for ensuring that powerful AIs are controlled
by
Dakara
4d
ago
•
Applied to
Access to powerful AI might make computer security radically easier
by
Dakara
4d
ago
•
Applied to
A basic systems architecture for AI agents that do autonomous research
by
Dakara
4d
ago
•
Applied to
Win/continue/lose scenarios and execute/replace/audit protocols
by
Dakara
4d
ago
•
Applied to
Why imperfect adversarial robustness doesn't doom AI control
by
Dakara
4d
ago
•
Applied to
Apply to the second iteration of the ML for Alignment Bootcamp (MLAB 2) in Berkeley [Aug 15 - Fri Sept 2]
by
Satron
4d
ago