Jemal Young

I specialize in regulatory affairs for AI-enabled Software as a Medical Device and hope to work in AI risk-mitigation.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

Newest

Discovering Language Model Behaviors with Model-Written Evaluations

Jemal Young2y45

How do we know that an LM's natural language responses can be interpreted literally? For example, if given a choice between "I'm okay with being turned off" and "I'm not okay with being turned off", and the model chooses either alternative, how do we know that it understands what its choice means? How do we know that it has expressed a preference, and not simply made a prediction about what the "correct" choice is?

Reply