I expect there to be too much happenstance encoded in my values.
I believe this is a bug, not a feature that we would like to reproduce.
I think that the direction you described with the AI analysing how you acquired your values is important, because it shouldn't be mimicking just your current values. It should be able to adapt the values to new situations the way you'd do (distributional shift). Think all the books / movies where people get to unusual situations and have to make tough moral calls. Like plane crashing in the middle of nowhere with 20 surv...
Sure. My best guess for how we will train AGI is via giving it access to huge amounts of data—e.g. every outbound Reddit link as in GPT-2. Given that sort of a massive, rich dataset, I think that for a model to determine the particular training procedure that it's undergoing won't be very difficult. If it just reads the Wikipedia pages on ML, gradient descent, etc. and some arXiv papers on transparency techniques, for example, then that should give it everything it needs to know.
Re "Prompt Injection, by Derik Kauffman, Aaron Kirtland, Andrew Gritsevskiy, and Joe Cavanagh (Third Prize)": text-davinci-003 works perfectly if the last input's first letter is lower case. It correctly predicts "Input" as the first token with >99% prob.
It seems that the last input being out-of-distribution wrt. to the few shot examples makes the model default to its priors and follow the instruction.
Also, interestingly, replacing the "following sentences" in the instruction with "following inputs" also fixes the model's behavior. I felt the term "sent... (read more)