Posts

Sorted by New

Wiki Contributions

Comments

This is great!  We were working on very similar things concurrently at OpenAI but ended up going a slightly different route. 

A few questions:
- What does the distribution of learned biases look like?
- For the STE variant, did you find it better to use the STE approximation for the activation gradient, even though the approximation is only needed for the bias?