User Comment Replies — AI Alignment Forum

Things I'm confused about:

How can the mechanism by which the model outputs ‘true’ representations of its processing be verified?

Re ‘translation mechanism’: How could a model use language to describe its processing if it includes novel concepts, mechanisms, or objects for which there are no existing examples in human-written text? Can a model fully know what it is doing?

Supposing an AI was capable of at least explaining around or gesturing towards this processing in a meaningful way - would humans be able to interpret these explanations sufficiently such th... (read more)

AI ALIGNMENT FORUM
AF

All of sage_bergerson's Comments + Replies