Kudos for releasing a concept of a plan! Some thoughts:
Regarding the first safety case:
The amount of progress in mech interp required to make the first safety case suitable seems overly optimistic to me; I basically think that most of the "limitations" are in fact pretty serious. However, I appreciate that the attempt to include concrete requirements.
I believe that getting good results from the following experiments might be particularly unrealistic:
"In order to robustify our evals against sandbagging, we ran an experiment where we steered one or more trut
Kudos for releasing a concept of a plan! Some thoughts:
Regarding the first safety case:
- The amount of progress in mech interp required to make the first safety case suitable seems overly optimistic to me; I basically think that most of the "limitations" are in fact pretty serious. However, I appreciate that the attempt to include concrete requirements.
- I believe that getting good results from the following experiments might be particularly unrealistic:
- "In order to robustify our evals against sandbagging, we ran an experiment where we steered one or more trut
... (read more)