I wish I knew why.
Same.
I don't really have any coherent hypotheses (not that I've tried for any fixed amount of time by the clock) for why this might be the case. I do, however, have a couple of vague suggestions for how one might go about gaining slightly more information that might lead to a hypothesis, if you're interested.
The main one involves looking at the local nonlinearities of the few layers after the intervention layer at various inputs, by which I mean examining diff(t) = f(input+t*top_right_vec) - f(input)
as a function of t (for small values o...
Thanks for the great post! I have a question, if it's not too much trouble:
Sorry for my confusion about something so silly, but shouldn't the following be "when "?
When there is no place where the derivative vanishes
I'm also a bit confused about why we can think of as representing "which moment of the interference distribution we care about."
Perhaps some of my confusion here stems from the fact that it seems to me that the optimal number of subspaces, , is an increasing function of , which ...
Hmmm, I suspect that when most people say things like "the reward function should be a human-aligned objective," they're intending something more like "the reward function is one for which any reasonable learning process, given enough time/data, would converge to an agent that ends up with human-aligned objectives," or perhaps the far weaker claim that "the reward function is one for which there exists a reasonable learning process that, given enough time/data, will converge to an agent that ends up with human-aligned objectives."