Chris_Leong

Comments

Sorted by

Acausal positive interpretation

I believe that Anthropic should be investigating artificial wisdom:

I've summarised a paper arguing for the importance of artificial wisdom with Yoshua Bengio being one of the authors.

I also have a short-form arguing for training wise AI advisors and an outline Some Preliminary Notes of the Promise of a Wisdom Explosion.

Nice article, I especially love the diagrams!

In Human Researcher Obsolescence you note that we can't completely hand over research unless we manage to produce agents that are at least as "wise" as the human developers.

I agree with this, though I would love to see a future version of this plan include an expanded analysis of the role that wise AI plays would play in the strategy of Magma, as I believe that this could be a key aspect of making this plan work.

In particular:

• We likely want to be developing wise AI advisors to advise us during the pre-hand-off period. In fact, I consider this likely to be vital to successfully navigating this period given the challenges involved.

• It's possible that we might manage to completely automate the more objective components of research without managing to completely automating the more subjective components of research. That said, we likely want to train wise AI advisors to help us with the more subjective components even if we can't defer to them.

• When developing AI capabilities, there's an additional lever in terms of how much Magma focuses on direct capabilities vs. focusing on wisdom.

I agree that we probably want most theory to be towards the applied end these days due to short timelines. Empirical work needs theory in order to direct it, theory needs empirics in order to remain grounded.

Thanks for writing this. I think it is a useful model. However, there is one thing I want to push back against:

Looking at behaviour is conceptually straightforward, and valuable, and being done

I agree with Apollo Research that evals isn't really a science yet. It mostly seems to be conducted according to vibes. Model internals could help with this, but things like building experience or auditing models using different schemes and comparing them could help make this more scientific.

Similarly, a lot of work with Model Organisms of Alignment requires a lot of careful thought to get right.

"How can we get more evidence on whether scheming is plausible?" - What if we ran experiments where we included some pressure towards scheming (either RL or fine-tuning) and we attempted to determine the minimum such pressure required to cause scheming? We could further attempt to see how this interacts with scaling.

I'd strongly bet that when you break this down in more concrete detail, a flaw in your plan will emerge. 

The balance of industries serving humans vs. AI's is a suspiciously high level of abstraction.

Thanks for posting this. I've been confused about the connection between shard theory and activation vectors for a long time!

AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other

This confuses me.

I can imagine an AIXI program where the utility function is compositional even if the optimisation is unitary. And I guess this isn't two full motivational circuits, but it kind of is tow motivational circuits.

I would suggest 50% of researchers working on a broader definition of control: including "control", technical governance work and technical outreach (scary demos, model organisms of misalignment). 

I’m confused by your use of Shapley values. Shapley values assume that the “coalition” can form in any order, but that doesn’t seem like a good fit for language models where order is important.

Load More