Transformer models (like GPT-3) are generators of human-like text, so they can be modeled as quantilizers. However, any quantiliser guarantees are very weak, because they quantilise with very low q, equal to the likelihood that a human would generate that prompt.
The implication seems to be that this RFP is for AIS work that is especially focused on DL systems. Is there likely to be a future RFP for AIS research that applies equally well to DL and non-DL systems? Regardless of where my research lands, I imagine a lot of useful and underfunded research fits in the latter category.
Thanks for these thoughts about the causal agenda. I basically agree with you on the facts, though I have a more favourable interpretation of how they bear on the potential of the causal incentives agenda. I've paraphrased the three bullet points, and responded in reverse order:
3) Many important incentives are not captured by the approach - e.g. sometimes an agent has an incentive to influence a variable, even if that variable does not cause reward attainment.
-> Agreed. We're starting to study "side-effect incentives" (improved name pending), whic...
One alternative would be to try to raise funds (e.g. perhaps from the EA LTF fund) to pay reviewers to perform reviews.
I don't (and perhaps shouldn't) have a guaranteed trigger - probably I will learn a lot more about what the trigger should be over the next couple years. But my current picture would be that the following are mostly true:
This idea has been discussed before. Though it's an important one, so I don't think it's a bad thing for us to bring it up again. My perspective now and previously is that this would be fairly bad at the moment, but might be good in a couple of years time.
My background understanding is that the purpose of a conference or journal in this case (and in general) is primarily to certify the quality of some work (and to a lesser extent, the field of inquiry). This in-turn helps with growing the AIS field, and the careers of AIS researchers.
This is only effective...
Aside from yourself, the other CHAI grad students don't seem to have written up their perspectives of what needs to be done about AI risk. Are they content to just each work on their own version of the problem?
I think this is actually pretty strategically reasonable.
CHAI students would have high returns to their probability of attaining a top professorship by writing papers, which is quite beneficial for later recruiting top talent to work on AI safety, and quite structurally beneficial for the establishment of AI safety as a field of research. The t...
Therefore, if epsilon is small enough (comparable to the probability of hitting an escape message at random), then the learning gets extremely slow and the oracle might shoot at the escape action at random.
The escape action being randomly called should not be a problem if it is a text string that is only read if r=1, and is ineffectual otherwise...
The biggest assumption is that the AI only cares about the reward it gets for the current timestep. A sufficiently intelligent agent would understand that after having being shutdown, an (almost) identical version of itself will probably be facing a similar question. Therefore, it it wants future versions of itself to succeed at (almost) exactly the same task, it should still want to escape.
I don't see why this has to be true, given that we get to choose the AI's value function. Why can't we just make the agent act-based?
My main concern abou...
As others have commented, it's difficult to understand what this math is supposed to say.
My understanding is that the sole central idea here is to have the agent know that the utility/reward it is given is a function of the evaluator's distribution over the state, but to try to maximize the utility that the evaluator would allocate if it knew the true state.
But this may be inaccurate, or there may be other material ideas here that I've missed.
[Note: This comment is three years later than the post]
The "obvious idea" here unfortunately seems not to work, because it is vulnerable to so-called "infinite improbability drives". Suppose is a shutdown button, and gives some weight to and . Then, the AI will benefit from selecting a Q such that it always chooses an action , in which it enters a lottery, and if it does not win, then it the button B is pushed. In this circumstance, is unchanged, while both and allocate almost al
...I can think of two problems:
This result features in the paper by Piccione and Rubeinstein that introduced the absent-minded driver problem [1].
Philosophers like decision theories that self-ratify, and this is indeed a powerful self-ratification principle.
This self-ratification principle does however rely on SIA probabilities assuming the current policy. We have shown that conditioning on your current policy, you will want to continue on with your current policy. i.e. the policy will be a Nash Equilibrium. There can be Nash Equilibria for other policies however. The UDT policy will
...I noticed that CEE is already named in philosophy. Conservation of expected ethics is roughly what what Artnzenius calls Weak Desire Reflection. He calls Conservation of expected evidence Belief Reflection. [1]
I’m thinking of modelling this as classical moral uncertainty over plausive value/reward functions in a set R={Ri}, but assuming that the probability of a given Ri is never assumed to go below a certain probability.
It's surprising to me that you would want your probabilities of each reward function to not approach zero, even asymptotically. In regular bandit problems, if your selection of some action never asymptotes toward zero, then you will necessarily keep making some kinds of mistakes forever, incurring linear regret. The same should be true for some suitable definition of regret if you stubbornly continue to behave according to some "wrong" moral theory.
The quantilizer idea seems excellent.
One note of caution - although it is neat to be able to bound loss, one can still question how useful meaningful 'loss' is here. Looking instead at utility, there is no bound on how much worse a quantilizer can perform relative to a random actor (nor how much it would contribute to our regret).
The real reason to use a quantilizer is that we expect that edge cases might have regret on the order of the cosmic endowment, but that most quantilized actions will not. For this thing that matters, there are no guarantees.
Constraining the output of an AI seems like a reasonable option to explore to me. I agree that generating a finite set of humanlike answers (with a chatbot or otherwise) might be a sensible way to do this. An AI could perform gradient descent over the solution space then pick the nearest proposed behaviour (it could work like relaxation in integer programming).
The multiple choice AI (with human-suggested options) is the most obvious option for avoiding unhumanlike behaviour. Paul has said in some medium comments that he thinks his more elaborate approach o
...
The idea that "Agents are systems that would adapt their policy if their actions influenced the world in a different way." works well on mechanised CIDs whose variables are neatly divided into object-level and mechanism nodes: we simply check for a path from a utility function F_U to a policy Pi_D. But to apply this to a physical system, we would need a way to obtain such a partition those variables. Specifically, we need to know (1) what counts as a policy, and (2) whether any of its antecedents count as representations of "influence" on the world (and af... (read more)