User Comment Replies — AI Alignment Forum

One small, concrete suggestion that I think is actually feasible: disable prefilling in the Anthropic API.

Prefilling is a known jailbreaking vector that no models, including Claude, defend against perfectly (as far as I know).

At OpenAI, we disable prefilling in our API for safety, despite knowing that customers love the better steerability it offers.

Getting all the major model providers to disable prefilling feels like a plausible 'race to top' equilibrium. The longer there are defectors from this equilibrium, the likelier that everyone gives up and serves... (read more)

Evan Hubinger1mo50

I can say now one reason why we allow this: we think Constitutional Classifiers are robust to prefill.

Why is o1 so deceptive?

Ted Sanders5mo*62

>The artificially generated data includes hallucinated links.

Not commenting on OpenAI's training data, but commenting generally: Models don't hallucinate because they've been trained on hallucinated data. They hallucinate because they've been trained on real data, but they can't remember it perfectly, so they guess. I hypothesize that URLs are very commonly hallucinated because they have a common, easy-to-remember format (so the model confidently starts to write them out) but hard-to-remember details (at which point the model just guesses because it knows a guessed URL is more likely than a URL that randomly cuts off after the http://www.).

2Abram Demski5mo

I agree, but this doesn't explain why it would (seemingly) encourage itself to hallucinate.

AI ALIGNMENT FORUM
AF

All of Ted Sanders's Comments + Replies