AI ALIGNMENT FORUM
AF

Egg Syntax
Ω204120
Message
Dialogue
Subscribe

AI safety & alignment researcher

In Rob Bensinger's typology: AGI-alarmed, tentative welfarist, and eventualist (or variabilist for sufficiently long values of 'soon').

I have signed no contracts or agreements whose existence I cannot mention.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
General Reasoning in LLMs
0eggsyntax's Shortform
2y
0
Good Research Takes are Not Sufficient for Good Strategic Takes
eggsyntax5mo1031

Strong upvote (both as object-level support and for setting a valuable precedent) for doing the quite difficult thing of saying "You should see me as less expert in some important areas than you currently do." 

Reply142
Why imperfect adversarial robustness doesn't doom AI control
eggsyntax9mo10

It seems like that assumption runs throughout the post though, eg 'But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said', 'the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were.'

I don't just mean showing the model its past CoT. Current models are often given lots of information about their past actions and interactions so that they can produce better output; why would you expect that to change? Is the assumption that labs will be willing to accept worse performance because they recognize the model can't be trusted?

Reply
Why imperfect adversarial robustness doesn't doom AI control
eggsyntax9mo*11

the untrusted model is stateless between queries and only sees the command history and system state.

What justifies the assumption that untrusted models will be stateless in deployment? There are clear performance advantages in many cases to giving models memory, and in practice they already typically receive context containing the history of their actions or interactions.

If you're assuming that commercial labs will stop doing that for the sake of safety despite its advantages, it seems worth making that assumption explicit.

Reply
Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions
eggsyntax1y10

Minor suggestion to link to more info on LLC (or at least write it out as local learning coefficient) the first time it's mentioned -- I see that you do that later down but seems confusing on first read.

Reply
33On the functional self of LLMs
2mo
0
28Numberwang: LLMs Doing Autonomous Research, and a Call for Input
7mo
1
15AIS terminology proposal: standardize terms for probability ranges
1y
1
68LLM Generality is a Timeline Crux
1y
0
58Language Models Model Us
1y
0
Logical decision theories
3y
(+5/-3)