Outer Alignment (also known as the reward misspecification problem) is the problem of specifying an reward function which captures human preferences. Outer alignment asks the question - "What should we aim our model at?" In other words, is the model optimizing for the correct reward such that there are no exploitable loopholes? ~~It is also known as the~~ ~~reward misspecification~~ ~~problem.~~

•

Applied to Terminal goal vs Intelligence by Donatas Lučiūnas 4d ago

•

Applied to Model Integrity by ryan.lowe 24d ago

•

Applied to Alignment is not intelligent by Donatas Lučiūnas 1mo ago

•

Applied to Why Recursive Self-Improvement Might Not Be the Existential Risk We Fear by Nassim_A 1mo ago

•

Applied to The default scenario for the next 50 years by Julien 1mo ago

•

Applied to Cross-context abduction: LLMs make inferences about procedural training data leveraging declarative facts in earlier training data by Sohaib Imran 2mo ago