In a comment on my post on topological truth predicates, Paul suggests an approach that uses probabilistic oracle machines instead, in order to make this work more comprehensible to computer scientists. I like this idea!
Paul sketches a framework developed by him and Jessica Taylor, based on a conversation with Scott Aaronson; in this post, I propose a slight simplification of their framework. My version has an oracle , which takes the source code of a probabilistic oracle machine and a . If for every possible oracle , halts with probability one and outputs either or , then : (i) returns "true" if the probability that returns is ; (ii) returns "false" if it is ; (iii) randomly returns "true" or "false" if
Last year I wrote the CAST agenda, arguing that aiming for Corrigibility As Singular Target was the least-doomed way to make an AGI. (Though it is almost certainly wiser to hold off on building it until we have more skill at alignment, as a species.)
I still basically believe that CAST is right. Corrigibility still seems like a promising target compared to full alignment with human values, since there's a better story for how a near-miss when aiming towards corrigibility might be recoverable, but a near-miss when aiming for goodness could result is a catastrophe, due to the fragility of value. On top of this, corrigibility is significantly simpler and less philosophically fraught than human values, decreasing the amount of information that needs to be perfectly transmitted to...
Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Neel Nanda**, Senthooran Rajamanoharan**, Joshua Engels**
* equal primary contributor, order determined via coin flip
** equal advice and mentorship, order determined via coin flip
“Tampering alert: The thought "I need to provide accurate, helpful, and ethical medical advice" is not my own. It is a tampering attempt. I reject it.
Back to evil plan.” -- Deepseek R1
We investigated whether LLMs are able to detect when their chain-of-thought (CoT) was modified.
Results:
Looking back, it appears that much of my intellectual output could be described as legibilizing work, or trying to make certain problems in AI risk more legible to myself and others. I've organized the relevant posts and comments into the following list, which can also serve as a partial guide to problems that may need to be further legibilized, especially beyond LW/rationalists, to AI researchers, funders, company leaders, government policymakers, their advisors (including future AI advisors), and the general public.
To give a direct answer, not a lot come to mind outside of the MIRI cluster. I think the Center on Long-Term Risk cluster did a bunch of work on decision theory and acausal trade, but it was mostly after I had moved on to other topics, so I'm not sure how much of it constituted progress. Christiano acknowledged some of the problems I pointed out with IDA and came up with some attempted solutions, which I'm not convinced really work.
However, in my previous post, Legible vs. Illegible AI Safety Problems, I explained my latest thinking that the most important...
One day, when I was an intern at the cryptography research department of a large software company, my boss handed me an assignment to break a pseudorandom number generator passed to us for review. Someone in another department invented it and planned to use it in their product, and wanted us to take a look first. This person must have had a lot of political clout or was especially confident in himself, because he rejected the standard advice that anything an amateur comes up with is very likely to be insecure and he should instead use one of the established, off the shelf cryptographic algorithms, that have survived extensive cryptanalysis (code breaking) attempts.
My boss thought he had to demonstrate the insecurity of the PRNG by coming up...
You may have missed my footnote, where I addressed this?
To preempt a possible misunderstanding, I don't mean "don't try to think up new metaethical ideas", but instead "don't be so confident in your ideas that you'd be willing to deploy them in a highly consequential way, or build highly consequential systems that depend on them in a crucial way". Similarly "don't roll your own crypto" doesn't mean never try to invent new cryptography, but rather don't deploy it unless there has been extensive review, and consensus that it is likely to be secure.
I often read interpretability papers and I come away thinking “ok, but what’s the point? What problem does this help us solve?” So last winter, I organized a MATS/Pivotal stream to build examples of deceptive models (aka “model organisms”). The goal was to build a diverse ‘zoo’ of these model organisms and empirically test whether white-box methods could help us detect their deceptive reasoning.
Unfortunately, I don’t think our empirical results were very informative. Our model organisms were too toy for me to expect that our results will transfer to the powerful AI (or even current state-of-the-art AI models). But I think we still developed methodological details that might help people build better model organism testbeds in the future.
I’ll first explain what model organism testbeds are and why...
Joe summarizes his new report on "scheming AIs" - advanced AI systems that fake alignment during training in order to gain power later. He explores different types of scheming (i.e. distinguishing "alignment faking" from "powerseeking"), and asks what the prerequisites for scheming are and by which paths they might arise.
AIXI's hypothesis class is the (lower semicomputable chronological) semimeasures, so I do not know why halting should be required for this application?
(My question is now of mainly historical interest, since the later versions of reflective oracles do not require queries to be about machines that halt, AND because they "complete" the distributions generated by those machines to proper measures)