AI ALIGNMENT FORUM
Tags
AF

Redwood Research

•

Applied to Polysemanticity and Capacity in Neural Networks by Dakara 4d ago

•

Applied to A quick experiment on LMs’ inductive biases in performing search by Dakara 4d ago

•

Applied to Balancing Label Quantity and Quality for Scalable Elicitation by Dakara 4d ago

•

Applied to Coup probes: Catching catastrophes with probes trained off-policy by Dakara 4d ago

•

Applied to Measurement tampering detection as a special case of weak-to-strong generalization by Dakara 4d ago

•

Applied to Toy models of AI control for concentrated catastrophe prevention by Dakara 4d ago

•

Applied to Notes on control evaluations for safety cases by Dakara 4d ago

•

Applied to [Paper] Stress-testing capability elicitation with password-locked models by Dakara 4d ago

•

Applied to Catching AIs red-handed by Dakara 4d ago

•

Applied to Improving the Welfare of AIs: A Nearcasted Proposal by Dakara 4d ago

•

Applied to Preventing Language Models from hiding their reasoning by Dakara 4d ago

•

Applied to Preventing model exfiltration with upload limits by Dakara 4d ago

•

Applied to Untrusted smart models and trusted dumb models by Dakara 4d ago

•

Applied to Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation by Dakara 4d ago

•

Applied to The case for ensuring that powerful AIs are controlled by Dakara 4d ago

•

Applied to Access to powerful AI might make computer security radically easier by Dakara 4d ago

•

Applied to A basic systems architecture for AI agents that do autonomous research by Dakara 4d ago

•

Applied to Win/continue/lose scenarios and execute/replace/audit protocols by Dakara 4d ago

•

Applied to Why imperfect adversarial robustness doesn't doom AI control by Dakara 4d ago

•

Applied to Apply to the second iteration of the ML for Alignment Bootcamp (MLAB 2) in Berkeley [Aug 15 - Fri Sept 2] by Satron 4d ago