User Comment Replies — AI Alignment Forum

I think that perhaps as a result of a balance of pros and cons, I initially was not very motivated to comment (and haven't been very motivated to engage much with ARC's recent work). But I decided maybe it's best to comment in a way that gives a better signal than silence.

I've generally been pretty confused about Formalizing the presumption of Independence and, as the post sort of implies, this is sort of the main advert that ARC have at the moment for the type of conceptual work that they are doing, so most of what I have to say is meta stuff ... (read more)

Paul Christiano2y60

I think this is a reasonable perception and opinion. We’ve written a little bit about how heuristic estimators might help with ELK (MAD and ELK and finding gliders), but that writing is not particularly clear and doesn’t present a complete picture.

We’ve mostly been focused on finding heuristic estimators, because I am fairly convinced they would be helpful and think that designing them is our key technical risk. But now that we are hiring again I think it’s important for us to explain publicly why they would be valuable, and to generally motivate and situa... (read more)

"Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability

carboniferous_umbraculum 2y22

I may come back to comment more or incorporate this post into something else I write but wanted to record my initial reaction which is that I basically believe the claim. I also think that the 'unrelated bonus reason' at the end is potentially important and probably deserves more thought.

0Arthur Conmy2y

Disclaimer: I work on interpretability at Redwood Research. I am also very interested in hearing a fleshed-out version of this critique. To me, this is related to the critique of Redwood's interpretability approach here, another example of "recruiting resources outside of the model alone". (however, it doesn't seem obvious to me that interpretability can't or won't work in such settings)

Behaviour Manifolds and the Hessian of the Total Loss - Notes and Criticism

carboniferous_umbraculum 3y21

I agree that the space $O^{D}$ may well miss important concepts and perspectives. As I say, it is not my suggestion to look at it, but rather just something that was implicitly being done in another post. The space $O^{R^{n}}$ may well be a more natural one. (It's of course the space of functions $R^{n} \to O$ , and so a space in which 'model space' naturally sits in some sense. )

Basin broadness depends on the size and number of orthogonal features

carboniferous_umbraculum 3y20

I wrote out the Hessian computation in a comment to one of Vivek's posts. I actually had a few concerns with his version and I could be wrong but I also think that there are some issues here. (My notation is slightly different because for me the sum over $x$ was included in the function I called " $l$ ", but it doesn't affect my main point).

I think the most concrete thing is that the function $f$ - i.e. the `input-output' function of a neural network - should in general have a vector output, but you write things like

\frac{d^{2} l}{d f^{2}}

witho... (read more)

Notes on Learning the Prior

carboniferous_umbraculum 3y42

Thanks very much Geoffrey; glad you liked the post. And thanks for the interesting extra remarks.

Information Loss --> Basin flatness

carboniferous_umbraculum 3y10

Thanks again for the reply.

In my notation, something like $\nabla l$ or $J f$ are functions in and of themselves. The function $\nabla l$ evaluates to zero at local minima of $l$ .

In my notation, there isn't any such thing as $\nabla_{f} l$ .

But look, I think that this is perhaps getting a little too bogged down for me to want to try to neatly resolve in the comment section, and I expect to be away from work for the next few days so may not check back for a while. Personally, I would just recommend going back and slowly going through the mathe... (read more)

Information Loss --> Basin flatness

carboniferous_umbraculum 3y20

Thanks for the substantive reply.

First some more specific/detailed comments: Regarding the relationship with the loss and with the Hessian of the loss, my concern sort of stems from the fact that the domains/codomains are different and so I think it deserves to be spelled out. The loss of a model with parameters $θ \in Θ$ can be described by introducing the actual function that maps the behavior to the real numbers, right? i.e. given some actual function $l : O^{k} \to R$ we have:

L : Θ f ⟶ O^{k} l ⟶ R

i.e. it's $l$ that might be something ... (read more)

1Vivek Hebbar3y

I will split this into a math reply, and a reply about the big picture / info loss interpretation. Math reply: Thanks for fleshing out the calculus rigorously; admittedly, I had not done this. Rather, I simply assumed MSE loss and proceeded largely through visual intuition. This is still false! Edit: I am now confused, I don't know if it is false or not. You are conflating ∇f l(f(θ)) and ∇θ l(f(θ)). Adding disambiguation, we have: ∇θ L(θ)=(∇f l(f(θ))) Jθf(θ) Hessθ(L)(θ)=Jθf(θ)T [Hessf(l)(f(θ))] Jθf(θ)+∇f l(f(θ)) D2θf(θ) So we see that the second term disappears if ∇f l(f(θ))=0. But the critical point condition is ∇θ l(f(θ))=0. From chain rule, we have: ∇θ l(f(θ))=(∇f l(f(θ))) Jθf(θ) So it is possible to have a local minimum where ∇f l(f(θ))≠0, if ∇f l(f(θ)) is in the left null-space of Jθf(θ). There is a nice qualitative interpretation as well, but I don't have energy/time to explain it. However, if we are at a perfect-behavior global minimum of a regression task, then ∇f l(f(θ)) is definitely zero. A few points about rank equality at a perfect-behavior global min: 1. rank(Hess(L))=rank(Jf) holds as long as Hess(l)(f(θ)) is a diagonal matrix. It need not be a multiple of the identity. 2. Hence, rank equality holds anytime the loss is a sum of functions s.t. each function only looks at a single component of the behavior. 3. If the network output is 1d (as assumed in the post), this just means that the loss is a sum over losses on individual inputs. 4. We can extend to larger outputs by having the behavior f be the flattened concatenation of outputs. The rank equality condition is still satisfied for MSE, Binary Cross Entropy, and Cross Entropy over a probability vector. It is not satisfied if we consider the behavior to be raw logits (before the softmax) and softmax+CrossEntropy as the loss function. But we can easily fix that by considering probability (after softmax) as behavior instead of raw logits.

Information Loss --> Basin flatness

carboniferous_umbraculum 3y50

This was pretty interesting and I like the general direction that the analysis goes in. I feel it ought to be pointed out that what is referred to here as the key result is a standard fact in differential geometry called (something like) the submersion theorem, which in turn is essentially an application of the implicit function theorem.

I think that your setup is essentially that there is an $N$ -dimensional parameter space, let's call it $Θ$ say, and then for each element $x_{i}$ of the training set, we can consider the function $f_{i} : Θ ⟶ O$ ... (read more)

1Vivek Hebbar3y

Thanks for this reply, its quite helpful. Ah nice, didn't know what it was called / what field it's from. I should clarify that "key result" here just meant "key result of the math so far -- pay attention", not "key result of the whole post" or "profound/original". Yeah, you're right. Previously I thought G was the Jacobian, because I had the Jacobian transposed in my head. I only realized that G has a standard name fairly late (as I was writing the post I think), and decided to keep the non-standard notation since I was used to it, and just add a footnote. Yes; this is the whole point of the post. The math is just a preliminary to get there. Good catch -- it is technically possible at a local minimum, although probably extremely rare. At a global minimum of a regression task it is not possible, since there is only one behavior vector corresponding to zero loss. Note that behavior in this post was defined specifically on the training set. At global minima, "Rank(Hessian(Loss))=Rank(G)" should be true without exception. In "Flat basin ≈ Low-rank Hessian = Low-rank G ≈ High manifold dimension": The first "≈" is a correlation. The second "≈" is the implication "High manifold dimension => Low-rank G". (Based on what you pointed out, this only works at global minima). "Indicates" here should be taken as slightly softened from "implies", like "strongly suggests but can't be proven to imply". Can you think of plausible mechanisms for causing low rank G which don't involve information loss?

Intuitions about solving hard problems

carboniferous_umbraculum 3y00

I broadly agree with Richard's main point, but I also do agree with this comment in the sense that I am not confident that the example of Turing compared with e.g. Einstein is completely fair/accurate.

One thing I would say in response to your comment, Adam, is that I don't usually see the message of your linked post as being incompatible with Richard's main point. I think one usually does have or does need productive mistakes that don't necessarily or obviously look like they are robust partial progress. But still, often when there actually is a brea... (read more)

0Adam Shimi3y

Thanks for the answer. Hum, I'm not sure I'm following your point. Do you mean that you can have both productive mistakes and intuitively compelling explanations when the final (or even intermediary breakthrough) is reached? Then I totally agree. My point was more that if you only use Richard's heuristic, I expect you to not reach the breakthrough because you would have killed in the bud many productive mistakes that actually lead the way there. There's also a very kuhnian thing here that I didn't really mention in my previous comment (except on the Galileo part): the compellingness of an answer is often stronger after the fact, when you work in the paradigm that it lead too. That's another aspect of productive mistakes or even breakthrough: they don't necessarily look right or predict more from the start, and evaluating their consequences is not necessarily obvious.

Call For Distillers

carboniferous_umbraculum 3y40

I agree i.e. I also (fairly weakly) disagree with the value of thinking of 'distilling' as a separate thing. Part of me wants to conjecture that it's comes from thinking of alignment work predominantly as mathematics or a hard science in which the standard 'unit' is a an original theorem or original result which might be poorly written up but can't really be argued against much. But if we think of the area (I'm thinking predominantly about more conceptual/theoretical alignment) as a 'softer', messier, ongoing discourse full of different arguments fro... (read more)

[Closed] Job Offering: Help Communicate Infrabayesianism

carboniferous_umbraculum 3y00

It could also work here. But I do feel like pointing out that the bounty format has other drawbacks. Maybe it works better when you want a variety of bitesize contributions, like various different proposals? I probably wouldn't do work like Abram proposes - quite a long and difficult project, I expect - for the chance of winning a prize, particularly if the winner(s) were decided by someone's subjective judgement.

1Algon3y

My intuition is that you could break this task down into smaller chunks, like applications of Infra-bayes and musings on why Infra-bayes worked better than existing tools there (or worse!), which someone could do within a couple of weeks, and award bounties for those tasks. Then offer jobs to whomever seems like they could do good distillations. I think that for a few 100 hour tasks, you might need to offer maybe $50k-$100k dollars. That sounds crazy high? Well AI safety is talent constrained, it doesn't look like much is being done with the money, and MIRI seems to think there's a high discount rate (doom within a decade or two) so money should be spent now on tasks that seem important.

[Closed] Job Offering: Help Communicate Infrabayesianism

carboniferous_umbraculum 3y30

This post caught my eye as my background is in mathematics and I was, in the not-too-distant past, excited about the idea of rigorous mathematical AI alignment work. My mind is still open to such work but I'll be honest, I've since become a bit less excited than I was. In particular, I definitely "bounced off" the existing write-ups on Infrabayesianism and now without already knowing what it's all about, it's not clear it's worth one's time. So, at the risk of making a basic or even cynical point: The remuneration of the proposed job could be important for getting attention/ incentivising people on-the-fence.

0Algon3y

Offering a bounty on what you want seems sensible here. It seemed like it worked OK for ELK proposals, so why not here?

AI ALIGNMENT FORUM
AF

All of carboniferous_umbraculum 's Comments + Replies