AI ALIGNMENT FORUMTags
AF

Language Models

•

Applied to If language is for communication, what does that imply about LLMs? by Bill Benzon 2d ago

•

Applied to Applying refusal-vector ablation to a Llama 3 70B agent by Vanessa Kosoy 3d ago

•

Applied to Navigating LLM embedding spaces using archetype-based directions by mwatkins 6d ago

•

Applied to Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant by Olli Järviniemi 8d ago

•

Applied to On precise out-of-context steering by Olli Järviniemi 11d ago

•

Applied to Mechanistically Eliciting Latent Behaviors in Language Models by Vanessa Kosoy 13d ago

•

Applied to LLMs could be as conscious as human emulations, potentially by weightt an 14d ago

•

Applied to An interesting mathematical model of how LLMs work by Bill Benzon 14d ago

•

Applied to LLMs seem (relatively) safe by JustisMills 18d ago

•

Applied to At last! ChatGPT does, shall we say, interesting imitations of “Kubla Khan” by Bill Benzon 20d ago

•

Applied to How LLMs Work, in the Style of The Economist by Rocket Drew 22d ago

•

Applied to What's up with all the non-Mormons? Weirdly specific universalities across LLMs by mwatkins 25d ago

•

Applied to Inducing Unprompted Misalignment in LLMs by Sam Svenningsen 26d ago

•

Applied to An examination of GPT-2's boring yet effective glitch by niplav 26d ago

•

Applied to Claude 3 Opus can operate as a Turing machine by Gunnar Zarncke 1mo ago

•

Applied to Experiments with an alternative method to promote sparsity in sparse autoencoders by Eoin Farrell 1mo ago

•

Applied to Claude wants to be conscious by Joe Kwon 1mo ago

•

Applied to Barcoding LLM Training Data Subsets. Anyone trying this for interpretability? by right..enough? 1mo ago

•

Applied to Is LLM Translation Without Rosetta Stone possible? by cubefox 1mo ago