Having become frustrated with the state of the discourse about AI catastrophe, Zack Davis writes both sides of the debate, with back-and-forth takes between Simplicia and Doominir that hope to spell out stronger arguments from both sides.
I've typed up some math notes for how much MSE loss we should expect for random embedings, and some other alternative embedings, for when you have more features than neurons. I don't have a good sense for how ledgeble this is to anyone but me.
Note that neither of these embedings are optimal. I belive that the optimal embeding for minimising MSE loss is to store the features in almost orthogonal directions, which is similar to ran...
This post is a comparison of various existing decision theories, with a focus on decision theories that use logical counterfactuals (a.k.a. the kind of decision theories most discussed on LessWrong). The post compares the decision theories along outermost iteration (action vs policy vs algorithm), updatelessness (updateless or updateful), and type of counterfactual used (causal, conditional, logical). It then explains the decision theories in more detail, in particular giving an expected utility formula for each. The post then gives examples of specific existing decision problems where the decision theories give different answers.
There are some other comparisons of decision theories (see the “Other comparisons” section), but they either (1) don’t focus on logical-counterfactual decision theories; or (2) are outdated (written before the new functional/logical decision theory terminology came about).
To give a more personal motivation, after reading through a bunch
Hey I'm interested in implementing some of these decision theories (and decision problems) in code. I have an initial version of CDT, EDT, and something I'm generically calling "FDT", but which I guess is actually some particular sub-variant of FDT in Python here, with the core decision theories implemented in about 45 lines of python code here. I'm wondering if anyone here might have suggestions on what would it look like to implement UDT in this framework -- either 1.0 or 1.1. I don't yet have a notion of "observation" in the code, so I can't yet impleme...
Some of my friends are signal-boosting this new article: 60 U.K. Lawmakers Accuse Google of Breaking AI Safety Pledge. See also the open letter. I don't feel good about this critique or the implicit ask.
Have you stated anywhere what makes you think "apparently a village idiot" is a sensible description of current learning programs, as they inform us regarding the question of whether or not we currently have something that is capable via generators sufficiently similar to [the generators of humanity's world-affecting capability] that we can reasonably induce that these systems are somewhat likely to kill everyone soon?
Here's a relatively important question regarding transparency requirements for AI companies: At which points in time should AI companies be required to disclose information? (While I focus on transparency, this question is also applicable to other safety-relevant requirements, and is applicable to norms around voluntary actions rather than requirements.)
A natural option would be to attach transparency requirements to the existing processes of pre-deployment testing and releasing a model card when a new model is released. As in, companies would be required to include the relevant information whenever they release a new model (likely in the model card). This is convenient because pre-deployment testing and model cards are already established norms in the AI industry, which makes it easier to attach something new to these existing processes rather...
Another downside is that pre-deployment risk assessments might increase the likelihood of a secret intelligence explosion via the mechanism of discouraging public release of models.
I was amused when Claude Opus abruptly stopped generating a reply to me and shut down the chat when I had asked it how a fictional galactic empire might control its frontier planets. Given that it stopped generating in the middle of a sentence that was talking about "biological monitoring" and "enhanced", I surmised that the reference to the genetically engineered catboys/catgirls in the setting had triggered its bioengineering filters.
There are two ways to show that an AI system is safe: show that it doesn't have dangerous capabilities, or show that it's safe even if it has dangerous capabilities. Until three months ago, AI companies said their models didn't have dangerous capabilities. (At the time, I wrote that the companies' eval reports didn't support their claims that their models lacked dangerous bio capabilities.) Now, Anthropic, OpenAI, Google DeepMind, and xAI say their most powerful models might have dangerous biology capabilities and thus could substantially boost extremists—but not states—in creating bioweapons. To prevent such misuse, they must (1) prevent extremists from doing misuse via API and (2) prevent extremists from acquiring the model weights.[1] For (1), they claim classifiers block bad model outputs; for (2), they claim their security prevents extremists...