Ted Sanders - AI Alignment Forum

One small, concrete suggestion that I think is actually feasible: disable prefilling in the Anthropic API.

Prefilling is a known jailbreaking vector that no models, including Claude, defend against perfectly (as far as I know).

At OpenAI, we disable prefilling in our API for safety, despite knowing that customers love the better steerability it offers.

Getting all the major model providers to disable prefilling feels like a plausible 'race to top' equilibrium. The longer there are defectors from this equilibrium, the likelier that everyone gives up and serves models in less safe configurations.

Just my opinion, though. Very open to the counterargument that prefilling doesn't meaningfully extend potential harms versus non-prefill jailbreaks.

(Edit: To those voting disagree, I'm curious why. Happy to update if I'm missing something.)

evhub's Shortform

Ted Sanders3mo*13-6

Why is o1 so deceptive?

Ted Sanders6mo*62

>The artificially generated data includes hallucinated links.

Not commenting on OpenAI's training data, but commenting generally: Models don't hallucinate because they've been trained on hallucinated data. They hallucinate because they've been trained on real data, but they can't remember it perfectly, so they guess. I hypothesize that URLs are very commonly hallucinated because they have a common, easy-to-remember format (so the model confidently starts to write them out) but hard-to-remember details (at which point the model just guesses because it knows a guessed URL is more likely than a URL that randomly cuts off after the http://www.).

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments