Your making a mistake : "If you ask it for a plan to manufacture paperclips, it doesn't think the best plan would involve converting all the matter in the solar system into paperclips. If you ask for a plan to eliminate cancer, it doesn't think the extermination of all biological life would count as a success." When you ask GPT for such things, you're not setting it's internal goal. You're just interracting with him. His internal goal might look like "Provide answer that are likely to satisfy the human , and write only potato if the human provide the exact string "s8r5d6s8rd4c8e" " You have no idea what's exactly it internal goal, but you're just seeing that so far its providing satisfying answers. Such misalligned internal goals can still bite us especially if the there are specific outliers not present in the training dataset
Your making a mistake : "If you ask it for a plan to manufacture paperclips, it doesn't think the best plan would involve converting all the matter in the solar system into paperclips. If you ask for a plan to eliminate cancer, it doesn't think the extermination of all biological life would count as a success."
When you ask GPT for such things, you're not setting it's internal goal. You're just interracting with him. His internal goal might look like "Provide answer that are likely to satisfy the human , and write only potato if the human provide the exact string "s8r5d6s8rd4c8e" "
You have no idea what's exactly it internal goal, but you're just seeing that so far its providing satisfying answers.
Such misalligned internal goals can still bite us especially if the there are specific outliers not present in the training dataset