Posts

Sorted by New

Wiki Contributions

Comments

Sorted by

Romeo Dean and I ran a slightly modified version of this format for members of AISST and we found it a very useful and enjoyable activity!

We first gathered to do 2 hours of reading and discussing, and then we spent 4 hours switching between silent writing and discussing in small groups.

The main changes we made are:

  1. We removed the part where people estimate probabilities of ASI and doom happening by the end of each other’s scenarios.
  2. We added a formal benchmark forecasting part for 7 benchmarks using private Metaculus questions (forecasting values at Jan 31 2025):
    1. GPQA
    2. SWE-bench
    3. GAIA
    4. InterCode (Bash)
    5. WebArena
    6. Number of METR tasks completed
    7. ELO on LMSys arena relative to GPT-4-1106

We think the first change made it better, but in hindsight we would have reduced the number of benchmarks to around 3 (GPQA, SWE-bench and LMSys ELO), or given participants much more time.