Algon - AI Alignment Forum

Explaining grokking through circuit efficiency

I disagree with the implicit view on how science works. When you are a computationally bounded reasoner, you work with partial hypotheses, i.e. hypotheses that only make predictions on a small subset of possible questions, and just shrug at other questions.

Implictly, I thought if a you have a partial hypothesis of grokking, then if it shrugs at an grokking related phenomena it should be penalized. Unless by "shrugs" you mean the details of what the partial hypothesis says in this particular case are still being worked out. But in that case, confirming the partial hypothesis doesn't say anything yet about some phenomena is still useful info. I'm fairly sure this belief was what generated my question.

This is mostly what happens with the other theories

Thank you for going through the theories and checking what they have to say. That was helpful to me.

I'd be interested in investigations that got into these quantitative questions

Do you have any plans to do this? How much time do you think it would take? And do you have any predictions for what should happen in these cases?

Explaining grokking through circuit efficiency

Algon2y32

Which of these theories:

Recent answers to this question vary widely, including the difficulty of representation learning (Liu et al., 2022), the scale of parameters at initialisation (Liu et al., 2023), spikes in loss ("slingshots") (Thilak et al., 2022), random walks among optimal solutions (Millidge et al., 2022), and the simplicity of the generalising solution (Nanda et al., 2023, Appendix E).

can predict the same "four novel predictions about grokking" yours did? The relative likelihoods are what matters for updates after all.

Also, how does this theory explain other grokking related pheonmena e.g. Omni-Grok? And how do things change as you increase parameter count? Scale matters, and I am not sure whether things like ungrokking would vanish with scale as catastrophic forgetting did. Or those various inverse scaling phenomena.

Explaining grokking through circuit efficiency

Algon2y41

ungrokking, in which a network regresses from perfect to low test accuracy

Is this the same thing as catastrophic forgetting?

Invulnerable Incomplete Preferences: A Formal Statement

Algon2y10

That is a fantastic answer, thank you. Do you think that there's any way your post could be wrong? For instance, "[letting] decision trees being the main model of an agent's environment", as per JohnWentworth in a discussion with EJT^[1] where he makes a similair critique to your point about their implicit dynamic choice principle?

^{^}
See the comments section of this post: https://www.lesswrong.com/posts/bzmLC3J8PsknwRZbr/why-not-subagents

Invulnerable Incomplete Preferences: A Formal Statement

Algon2y40

So where does JohnWentworth and David Lorell's arguement fail? Why does it fail?

Invulnerable Incomplete Preferences: A Formal Statement

Algon2y10

Section 3 begins with a conceptual argument suggesting that DSM-based choice under uncertainty will not, even behaviourally, effectively alter the agent’s preferences over time.

Hold up, does this mean you don't have a proof that agents who are Certain about their Tree are trammelled or not?

EDIT: Ah, I see you confirm that in section 3. Good to know.

Coherent extrapolated dreaming

Algon3y44

EDIT: I meant what Wei Dai has been "talking about", not "trying to do".

This problem you are pointing at sounds like what Wei Dai has been trying to do for years. In some sense, it is like getting a fully specified meta-ethical framework, of the kind Eliezer attempted to describe in the Sequences. Does that sound right?

My AGI safety research—2022 review, ’23 plans

Algon3y30

I expect that publishing would net decrease s-risks, not increase them. However

Yeah, I'd be interested in this, and will email you. That said, I'll just lay out my concerns here for posterity. What generated my question in the first place was thinking "what could possibly go wrong with publishing a reward function for social instincts?" My brain helpfully suggested that someone would use it to cognitively-shape their AI in a half-assed manner because they thought that the reward function is all they would need. Next thing you know, we're all living in super-hell^[1].

You didn’t bring this up, but I think there’s a small but nonzero chance that the story of social instincts will wind up involving aspects that I don’t want to publish because of concerns about speeding timelines-to-AGI

You mind giving some hypothetical examples? This sounds plausible, but I'm struggling to think of concrete examples beyond vague thoughts like "maybe explaining social instincts involves describing a mechanism for sample efficient learning".

^{^}
Yes, that is an exaggeration, but I like the sentence.

My AGI safety research—2022 review, ’23 plans

Algon3y40

Will you publish all the progress you make on decoding social instincts, or would that result in an unacceptable increase in s-risks and/or socially-capable-AI?

You can still fetch the coffee today if you're dead tomorrow

Algon3y3-4

Isn't this the same as the "seamless transition for reward maximizers" technique described in section 5.1 of Stuart and Xavier's 2017 paper on utility indifference methods? It is a good idea, of course, and if you independantly invented it, kudos, but it seems like something that already exists.

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments