(Update Jan. 12: We released an FAQ last month, with more details. Last updated Jan. 7.)

(Update Jan. 19: We now have an example of a successful partial run, which you can use to inform how you do your runs. Details.)

 

We at MIRI are soliciting help with an AI-alignment project centered around building a dataset, described below. We have $200,000 in prizes for building the first fragments of the dataset, plus an additional $1M prize/budget for anyone who demonstrates the ability to build a larger dataset at scale.

If this project goes well, then it may be the first of a series of prizes we offer for various projects.

Below, I’ll say more about the project, and about the payouts and interim support we’re offering.

 

The Project

Hypothesis: Language models can be made more understandable (and perhaps also more capable, though this is not the goal) by training them to produce visible thoughts

We’d like to test this hypothesis by fine-tuning/retraining a language model using a dataset composed of thought-annotated dungeon runs. (In the manner of AI dungeon.)

A normal (un-annotated) dungeon run is a sequence of steps in which the player inputs text actions and the dungeon master responds with text describing what happened in the world as a result.

We’d like a collection of such runs, that are annotated with "visible thoughts" (visible to potential operators or programmers of the system, not to players) describing things like what just happened or is about to happen in the world, what sorts of things the player is probably paying attention to, where the current sources of plot tension are, and so on — the sorts of things a human author would think while acting as a dungeon master.  (This is distinct from producing thoughts explaining what happened in the dungeon; “visible thoughts” are meant to play an active role in constructing the output.)

Once we have such a dataset, MIRI’s hope is that present or future technology will be able to train a model or models which iteratively produce visible thoughts along with storytelling, based on user actions plus previous history (including previous thoughts). The goal is to transition the state of AI dungeon technology from “An AI outputs story text in response to actions (and we have no idea how)” to “An AI produces thoughts as visible intermediates on the way to story text, allowing us to watch the AI think about how to design its output, and to verify that we can get different sensible outputs by intervening on the thoughts”. 

Here’s an example of the first couple of steps of a thought-annotated dungeon run (or “quest”), in the format MIRI currently thinks is worth trying. Some kinds of thoughts are marked with parentheses and/or brackets; see the next section for details on this.


  • Thoughts:
    • [The main plot:  Prota is a student at a magical academy that is under attack by zombies.]
    • [(This is a quest of medium length.)]
    • (This is the start of the story.  The Player should be given enough backstory and immediate description to have some sense of where they are, so the prompt can go on longer than usual.  The Player should be told who they are and their role in the universe.  The initial prompt should describe Prota being in a magical academy, and should describe the first surprising indication that the academy is under attack by zombies.)
  • Prompt: Today is your first day of classes at the Magical Academy. You arrived very late last night and missed the tour, but your roommate Lior promised he would help you find your way around. You wake up in the students’ dormitory at dawn, and follow Lior to breakfast. Just as you reach the dining hall, you hear a shout in the distance.
  • Thoughts:
    • [Prota missed the tour and doesn’t already know the way around.]
    • [Prota’s roommate’s name is Lior.]
    • [Lior is male.]
    • There are other people in the dining hall who will have also heard the shout, and will also be surprised and reacting to that.
  • Player action: What was that?
  • Thoughts: Prota is talking to Lior.
  • Outcome: “What was that?”, you say to Lior.

  • Thoughts:
    • Lior arrived earlier than Prota and had the tour of the Academy, since they knew the way to the dining hall. Lior will also have met the instructors and other students. Lior will know who is shouting and where they are shouting from.
    • (The event of hearing the shout is meant to introduce the opening premise of the story.)
    • The shout is by somebody who just discovered a zombie.
    • The shout is by another first-year student named Morven.
    • [Morven is a Fire mage.  Morven is aggressive.]
    • Morven will try to attack the zombie with fire.  This is happening where Prota and Lior cannot see it, only hear the shouts.
    • Lior knows that the shout is by Morven.  Lior knows Morven is a Fire mage.  Lior does not know Morven is attacking a zombie.  Lior does not know that Morven has good reason to be shouting.  Lior will worry that Morven is about to set something on fire.  Lior is on good terms with the protagonist and will speak their concerns honestly.
  • Prompt: “That shout sounded like Morven, a first-year Fire mage,” says Lior.  “I hope they aren’t about to set the Academy on fire.  We just got here.”
  • Thoughts: <none>
  • Player action: Should we go see what’s going on? Or is that too dangerous?
  • Thoughts: Prota is talking to Lior.
  • Outcome: You say to Lior, “Should we go see what’s going on?  Or is that too dangerous?”

A difficult first step in testing the hypothesis above is generating a sufficiently large dataset (suitable for language model retraining) of thought-annotated dungeon runs. This likely requires at least a moderate degree of introspective and authorial skill from the people creating the dataset. See this sample of a partial run to get a further sense of what we are looking for. More detail on the type of thing we’re looking for can hopefully be inferred from that sample, though applicants will also have a chance to ask clarifying questions.

The project of producing this dataset is open starting immediately, in a hybrid prize/grant format. We will pay $20,000 per run for the first 10 completed runs that meet our quality standard (as decided unilaterally by Eliezer Yudkowsky or his designates), and $1M total for the first batch of 100 runs beyond that.

If we think your attempt is sufficiently promising, we’re willing to cover your expenses (e.g., the costs of paying the authors) upfront, and we may also be willing to compensate you for your time upfront. You’re welcome to write individual runs manually, though note that we’re most enthusiastic about finding solutions that scale well, and then scaling them. More details on the payout process can be found below.

 

The Machine Learning Experiment

In slightly more detail, the plan is as follows (where the $1.2M prizes/budgets are for help with part 1, and part 2 is what we plan to subsequently do with the dataset):

 

1. Collect a dataset of 10, then ~100 thought-annotated dungeon runs (each run a self-contained story arc) of ~1,000 steps each, where each step contains:

  • Thoughts (~250 words on average per step) are things the dungeon master was thinking when constructing the story, including:
    • Reasoning about the fictional world, such as summaries of what just happened and discussion of the consequences that are likely to follow (Watsonian reasoning), which are rendered in plain-text in the above example;
    • Reasoning about the story itself, like where the plot tension lies, or what mysteries were just introduced, or what the player is likely wondering about (Doylist reasoning), which are rendered in (parentheses) in the above example; and
    • New or refined information about the fictional world that is important to remember in the non-immediate future, such as important facts about a character, or records of important items that the protagonist has acquired, which are rendered in [square brackets] in the above example;
    • Optionally: some examples of meta-cognition intended to, for example, represent a dungeon master noticing that the story has no obvious way forward or their thoughts about where to go next have petered out, so they need to back up and rethink where the story is going, rendered in {braces}.
  • The prompt (~50 words on average) is the sort of story/description/prompt thingy that a dungeon master gives to the player, and can optionally also include a small number of attached thoughts where information about choices and updates to the world-state can be recorded.
  • The action (~2–20 words) is the sort of thing that a player gives in response to a prompt, and can optionally also include a thought if interpreting the action is not straightforward (especially if, e.g., the player describes themselves doing something impossible).

It’s unclear to us how much skill is required to produce this dataset. The authors likely need to be reasonably introspective about their own writing process, and willing to try things and make changes in response to initial feedback from the project leader and/or from MIRI.

A rough estimate is that a run of 1,000 steps is around 300k words of mostly thoughts, costing around 2 skilled author-months. (A dungeon run does not need to be published-novel-quality literature, only coherent in how the world responds to characters!) A guess as to the necessary database size is ~100 runs, for about 30M words and 20 author-years (though we may test first with fewer/shorter runs).

 

2. Retrain a large pretrained language model, like GPT-3 or T5 

A reasonable guess is that performance more like GPT-3 than GPT-2 (at least) is needed to really make use of the thought-intermediates, but in lieu of a large pretrained language model we could plausibly attempt to train our own smaller one.

Our own initial idea for the ML architecture would be to retrain one mode of the model to take (some suffix window of) the history units and predict thoughts, by minimizing the log loss of the generated thought against the next thought in the run, and to retrain a second mode to take (some suffix window of) the history units plus one thought, and produce a prompt, by minimizing the log loss of the generated prompt against the next prompt in the run.

Imaginably, this could lead to the creation of dungeon runs that are qualitatively “more coherent” than those generated by existing methods. The primary goal, however, is that the thought-producing fragment of the system gives some qualitative access to the system’s internals that, e.g., allow an untrained observer to accurately predict the local developments of the story, and occasionally answer questions about why things in the story happened; or that, if we don’t like how the story developed, we can intervene on the thoughts and get a different story in a controllable way.

 

Motivation for this project

Many alignment proposals floating around in the community are based on AIs having human-interpretable thoughts in one form or another (e.g., in Hubinger’s survey article and in work by Christiano, by Olah, and by Leike). For example, this is implicit in the claim that humans will be able to inspect and understand the AI’s thought process well enough to detect early signs of deceptive behavior. Another class of alignment schemes is based on the AI’s thoughts being locally human-esque in some fashion that allows them to be trained against the thoughts of actual humans.

I (Nate) personally don’t have much hope in plans such as these, for a variety of reasons. However, that doesn’t stop Eliezer and me from wanting to rush ahead and start gathering empirical evidence about how possible it is in practice to get modern AI systems to factor their cognition through human-interpretable visible intermediates.

Modern AIs are notably good at crafting English text. Some are currently used to run dungeons (with modest success). If you wanted to look at the place where current AIs excel the most in crafting artifacts, among the artifacts they are best and most impressive at crafting are English paragraphs.

Furthermore, compared to many other things AIs have learned to do, if you consider the task of running a responsive text dungeon, it seems relatively possible to ask a (relatively unusually) introspective human author to write down their thoughts about how and why they would generate the next prompt from the user’s input.

So we are taking one of the outputs that current AIs seem to have learned best to design, and taking one of the places where human thoughts about how to design it seem most accessible, and trying to produce a dataset which the current or next generation of text predictors might be able to use to learn how to predict thoughts about designing their outputs and not just predict the outputs themselves.

This sort of interpretability is distinct from the sort of transparency work in something like Circuits (led by Chris Olah) — while Circuits is trying to “open the black box” of machine learning systems by directly looking at what is happening inside of them, the project proposed here is just attempting the less ambitious task of having black-box models output interpretable intermediates producing explanations for their behavior (but how such black box models might go about doing that internally is left unconstrained). The reason for our focus on this particular project of visible thoughts isn’t because we believe it to be better or more fruitful than Circuits-style transparency (we have said for years that Circuits-style research deserves all possible dollars that can be productively spent on it), but just because it’s a different approach where it might also be possible to push progress forward.

Note that proponents of alignment strategies that involve human-esque thoughts (such as those linked above) do not necessarily endorse this particular experiment as testing any of their key uncertainties or confusions. We welcome suggested tweaks to the experiment (in the comments of the version of this announcement as it occurs on LessWrong) from any such proponents, to render it a better test of your ideas. (Though even if it doesn’t sate your own curiosity, we expect to learn some things ourselves.)

The main thing this project needs is a dataset, so MIRI is starting on producing that dataset. It’s plausible to us that GPT-3 will prove wholly unable to make use of this dataset; even if GPT-3 can’t, perhaps GPT-4 or some other future system will be able to.

There are additional more general reasons to work on this project. Specifically, it seems to me (Nate) and to Eliezer that capacity to execute projects such as this one is the current limiting bottleneck on MIRI. By pursuing this project, we attempt to resolve that bottleneck.

We hope, through this process, to build our capacity to execute on a variety of projects — perhaps by succeeding at the stated objective of building a dataset, or perhaps by learning about what we’re doing wrong and moving on to better methods of acquiring executive talent. I’ll say more about this goal in “Motivation for the public appeal” below.

 

Notes on Closure

I (Nate) find it plausible that there are capabilities advances to be had from training language models on thought-annotated dungeon runs. Locally these might look like increased coherence of the overall narrative arc, increased maintenance of local story tension, and increased consistency in the described world-state over the course of the run.  If successful, the idiom might generalize further; it would have to, in order to play a role in later alignment of AGI.

As a matter of policy, whenever a project like this has plausible capabilities implications, we think the correct response is to try doing it in-house and privately before doing it publicly — and, of course, only then when the alignment benefits outweigh the plausible capability boosts. In this case, we tried to execute this project in a closed way in mid-2021, but work was not proceeding fast enough. Given that slowness, and in light of others publishing related explorations and results, and in light of the relatively modest plausible capability gains, we are moving on relatively quickly past the attempt to do this privately, and are now attempting to do it publicly.

 

Motivation for the public appeal

I (Nate) don’t know of any plan for achieving a stellar future that I believe has much hope worth speaking of. I consider this one of our key bottlenecks. Offering prizes for small projects such as these doesn’t address that bottleneck directly, and I don’t want to imply that any such projects are going to be world-saving in their own right.

That said, I think an important secondary bottleneck is finding people with a rare combination of executive/leadership/management skill plus a specific kind of vision. While we don’t have any plans that I’m particularly hopeful about, we do have a handful of plans that contain at least a shred of hope, and that I’m enthusiastic about pursuing — partly in pursuit of those shreds of hope, and partly to build the sort of capacity that would let us take advantage of a miracle if we get one.

The specific type of vision we’re looking for is the type that’s compatible with the project at hand. For starters, Eliezer has a handful of ideas that seem to me worth pursuing, but for all of them to be pursued, we need people who can not only lead those projects themselves, but who can understand the hope-containing heart of the idea with relatively little Eliezer-interaction, and develop a vision around it that retains the shred of hope and doesn’t require constant interaction and course-correction on our part. (This is, as far as I can tell, a version of the Hard Problem of finding good founders, but with an additional constraint of filtering for people who have affinity for a particular project, rather than people who have affinity for some project of their own devising.)

We are experimenting with offering healthy bounties in hopes of finding people who have both the leadership/executive capacity needed, and an affinity for some ideas that seem to us to hold a shred of hope.

If you’re good at this, we’re likely to make you an employment offer.

 

The Payouts

Our total prize budget for this program is $1.2M. We intend to use it to find a person who can build the dataset in a way that scales, presumably by finding and coordinating a pool of sufficiently introspective writers. We would compensate them generously, and we would hope to continue working with that person on future projects (though this is not a requirement in order to receive the payout).

We will pay $20k per run for the first 10 thought-annotated runs that we accept. We are willing to support applicants in producing these runs by providing them with resources up-front, including small salaries and budgets for hiring writers. The up-front costs a participant incurs will be deducted from their prizes, if they receive prizes. An additional $1M then goes to anyone among the applicants who demonstrates the ability to scale their run-creating process to produce 100 runs. Our intent is for participants to use some of that money to produce the 100 runs, and keep the remainder as a prize. If multiple participants demonstrate similar abilities to scale at similar quality-levels and similar times, the money may be split between them. We plan to report prize awards publicly.

In principle, all you need to do to get paid for thought-annotated dungeon runs is send us runs that we like. If your run is one of the first 10 runs, or if you’re the first to provide a batch of 100, you get the corresponding payment.

That said, whether or not we decide to pay for a run is entirely and unilaterally up to Eliezer Yudkowsky or his delegates, and will depend on whether the run hits a minimum quality bar. Also, we are willing to pay out from the $1M prize/budget upon becoming convinced that you can scale your process, which may occur before you produce a full 100 runs. We therefore strongly recommend getting in contact with us and proactively making sure that you’re on the right track, before sinking large amounts of time and energy into this project. Our senior research staff are willing to spend time on initial conversations and occasional check-ins. For more information on our support resources and how to access them, refer to the support and application sections below.

Note that we may tune or refine the bounty in response to feedback in the first week after this post goes live.

 

Support

We intend to offer various types of support for people attempting this project, including an initial conversation; occasional check-ins; office space; limited operational support; and certain types of funding.

We currently expect to have (a limited number of) slots for initial conversations and weekly check-ins, along with (a limited amount of) office space and desks in Berkeley, California for people working on this project. We are willing to pay expenses, and to give more general compensation, in proportion to how promising we think your attempts are.

If you’d like to take advantage of these resources, follow the application process described below.

 

Application

You do not need to have sent us an application in order to get payouts, in principle. We will pay for any satisfactory run sent our way. That said, if you would like any of the support listed above (and we strongly recommend at least one check-in to get a better understanding of what counts as success), complete the following process:

  • Describe the general idea of a thought-annotated dungeon run in your own words.
  • Write 2 (thought, prompt, thought, action, thought, outcome) sextuples you believe are good, 1 you think is borderline, and 1 you think is bad.
  • Provide your own commentary on this run.
  • Email all this to projects@intelligence.org.

If we think your application is sufficiently promising, we’ll schedule a 20 minute video call with some senior MIRI research staff and work from there.

New Comment
15 comments, sorted by Click to highlight new comments since:

This plausibly looks like an existing collection of works which seem to be annotated in a similar way: https://www.amazon.com/Star-Wars-Screenplays-Laurent-Bouzereau/dp/0345409817

I think this is an interesting project, and one that (from a very different angle) I’ve spent a bit of time on, so here are a few notes on that, followed by a few suggestions. Stella, in another comment, made several great points that I agree with and that are similar in spirit to my suggestions.

Anyway, based on a fairly similar motivation of wanting to be able to “ask a LM what it’s actually thinking/expecting”, combined with the general tendency to want to do the simplest and cheapest thing possible first… and then try to make it even simpler still before starting… we’ve experimented with including metadata in language pretraining data. Most large language datasets have this information, e.g. books have titles and (maybe) blurbs, websites have titles, URLs, and (maybe) associated subreddit links, etc. This data is obviously much noisier and lower quality than what you get from paying people for annotations, but it’s voluminous, diverse, and ~free.

When inserting this metadata for pretraining, we made sure to do so completely randomly, i.e. a book title might be inserted anywhere within a book (maybe several times in different context windows etc). We added separate <META_START> and <META_STOP> tokens to indicate the beginning and end of metadata, but that’s it. The motivation was to ensure that this “thought stream” was in-distribution at all positions within the context, while conversely making it easy to never sample it (by declining to sample the start token). This means that we can both use it when prompting, and use it as a query -- ie we can ask the model, at any time, “how likely is this to be from the NYTimes vs from 4Chan” by evaluating the logprobs of text enclosed by the tokens. With this specification, one can do a kind of “metadata beam search” where you prompt, sample, evaluate, cull, and repeat. 


We generally found that this sort of works, in that the mutual information between these labels and the text goes up with model size, and you can use these metadata tags as filters to get rid of some of the most irrelevant text. But the results weren’t immediately stunning, and so we didn’t investigate them much further (to be clear, this was mostly because we prioritized other things more highly, rather than because we don't view this as worthwhile).  

So my general suggestion would be to start off with something very cheap first, like the above. At the very least, this will mean that when you finetune on higher quality data, your format is already on-distribution. But hopefully it’ll also help you to calibrate expectations and give you a better sense for exactly what kind of data you want to shell out money for.

Beyond that, I agree with what Stella said -- it seems easier and better to focus first on shorter passages, both for human-sourcing reasons, and for diversity.  Typically the benefits we see from finetuning grow with something like the log of the dataset size, so a small number of shorter examples should quickly give you an idea of what kind of progress you can expect.

If it were me, I’d also try to increase RoI by asking people to add commentary to existing books, rather than having people write from scratch.  And I’d suggest making the formatting as simple and general as possible, both so that you can use and investigate it very flexibly, and to minimize regret if you change your mind in the future.

combined with the general tendency to want to do the simplest and cheapest thing possible first… and then try to make it even simpler still before starting… we’ve experimented with including metadata in language pretraining data. Most large language datasets have this information, e.g. books have titles and (maybe) blurbs, websites have titles, URLs, and (maybe) associated subreddit links, etc. This data is obviously much noisier and lower quality than what you get from paying people for annotations, but it’s voluminous, diverse, and ~free.

I'm sympathetic to the desire to keep things simple, but I actually think that getting good at scalably collecting rich human data is probably the most valuable part of the project. I'd be really excited to see Anthropic either building an excellent internal human data team, or figuring out how to work productively with one of the existing human data provider startups.

I am very excited about finding scalable ways to collect large volumes of high-quality data on weird, specific tasks. This seems very robustly useful for alignment, and not something we're currently that good at. I'm a bit less convinced that this task itself is particularly useful.

Have you reached out to e.g. https://www.surgehq.ai/ or another one of the companies that does human-data-generation-as-a-service?

Random small note - the 'dungeon' theme is slightly ...culturally offputting? or something for me, as someone who's never been into this kind of thing or played any of these and is therefore a bit confused about what exactly this involves, and has vague negative associations (I guess because dungeons sound unpleasant?). I wonder if something a bit blander like a story, play, or AI assistant setting could be better?

Someone who wants to claim the bounty could just buy the dataset from one of the companies that does this sort of thing, if they're able to produce a sufficiently high-quality version, I assume? Would that be in the spirit of the bounty?

We have now received the first partial run that meets our quality bar. The run was submitted by LessWrong user Vanilla_cabs. Vanilla's team is still expanding the run (and will probably fix some typos, etc. later), but I'm providing a copy of it here with Vanilla's permission, to give others an example of the kind of thing we're looking for:

https://docs.google.com/document/d/1Wsh8L--jtJ6y9ZB35mEbzVZ8lJN6UDd6oiF0_Bta8vM/edit

Vanilla's run is currently 266 steps long. Per the Visible Thoughts Project FAQ, we're willing to pay authors $20 / step for partial runs that meet our quality bar (up to at least the first 5,000 total steps we're sent), so the partial run here will receive $5320 from the prize pool (though the final version will presumably be much longer and receive more; we expect a completed run to be about 1000 steps).

Vanilla_cabs is open to doing paid consultation for anyone who's working on this project. So if you want feedback from someone who understands our quality bar and can demonstrably pass it, contact Vanilla_cabs via their LessWrong profile.

How do you think this project relates to Ought? Seems like the projects share a basic objective (having AI predict human thoughts had in the course of solving a task). Ought has more detailed proposals for how the thoughts are being used to solve the task (in terms of e.g. factoring a problem into smaller problems, so that the internal thoughts are a load-bearing part of the computation rather than an annotation that is predicted but not checked for being relevant).

So we are taking one of the outputs that current AIs seem to have learned best to design, and taking one of the places where human thoughts about how to design it seem most accessible, and trying to produce a dataset which the current or next generation of text predictors might be able to use to learn how to predict thoughts about designing their outputs and not just predict the outputs themselves.

As the proposal stands it seems like the AI's predictions of human thoughts would offer no relevant information about how the AI is predicting the non-thought story content, since the AI could be predicting these different pieces of content through unrelated mechanisms.

As the proposal stands it seems like the AI's predictions of human thoughts would offer no relevant information about how the AI is predicting the non-thought story content, since the AI could be predicting these different pieces of content through unrelated mechanisms.

Might depend whether the "thought" part comes before or after particular story text. If the "thought" comes after that story text, then it's generated conditional on that text, essentially a rationalization of that text from a hypothetical DM's point of view. If it comes before that story text, then the story is being generated conditional on it.

Personally I think I might go for a two-phase process. Do the task with a lot of transparent detail in phase 1. Summarize that detail and filter out infohazards in phase 2, but link from the summary to the detailed version so a human can check things as needed (flagging links to plausible infohazards). (I guess you could flag links to parts that seemed especially likely to be incorrigible/manipulative cognition, or parts of the summary that the summarizer was less confident in, as well.)

In case you missed it: we now have an FAQ for this project, last updated Jan. 7.

Came across this today on r/mlscaling and thought I'd put it here since it's relevant: https://arxiv.org/abs/2201.11903#google

This paper explores the ability of language models to generate a coherent chain of thought—a series of short sentences that mimic the reasoning process a person might have when responding to a question. Experiments show that inducing a chain of thought via prompting can enable sufficiently large language models to better perform reasoning tasks that otherwise have flat scaling curves.

It seems to me like this should be pretty easy to do and I'm disappointed there hasn't been more action on it yet. Things I'd try:
- reach out to various human-data-as-a-service companies like SurgeHQ, Scale, Samasource
- look for people on upwork 
- find people who write fiction on the internet (e.g. post on fanfiction forums) and offer to pay them to annotate their existing stories (not a dungeon run exactly, but I don't see why the dungeon setting is important)

I'd be interested to hear if anyone has tried these things and run into roadblocks.

I'm also interested if anyone has an explanation of why the focus is on the dungeon thing in particular rather than e.g. fiction generally.

One concern I'd have with this dataset is that the thoughts are post-hoc rationalizations for what is written rather than actually the thought process that went into it. To reduce this, you could do something like split it so one person writes the thoughts, and someone else writes the next step, without other communication.

It seems to me that the comments in code provide "visible thoughts" for what the programmer intends. What do you hope to learn from training language models on thought-annotated dungeons that you couldn't learn from language models that have already been trained on commented code?

Some naive thoughts in case useful:

A) Is the structured annotation format more useful than a gamemaster/writer thinking aloud while recording themselves (possibly with an audience)?

That could be the closest thing to a full transcript of the human process which downstream tasks could condense as needed. An adopted annotation format (prescribed or not) could potentially cause thoughts to be filtered, reinterpreted, or even steer human generation?

One key example against a fixed-format annotation, I think is that human gamemasters and writers do not spend approximate constant effort per player action. They will do a lot of up-front work to have a plan for the story, can go on auto-pilot for many of the interactions, while thinking hard about critical parts of the story. Language models which generate stories today notoriously seem to lack this red thread and filling out a form summarizing the writers' thoughts may fail to capture this process.

The unstructured approach may also be closer to what pretrained models have learned and therefore require less data.

It could perhaps also provide a highly interesting dataset for another task relevant to the application - metareasoning in generation - should the agent output the next part of the story or keep thinking about the generation?

Alternatively, one could record all thoughts as they come, but follow up each output with some standardized questions - if there are some critical to the application?

B) I am curious whether sufficiently strong language models wouldn't be able to fake the explanations post-hoc.

At least, looking at the forms, I am not sure whether I could tell competent explanations apart. If that is the case, it could be that the dataset does not get us that far in interpretability and lead to more specific needs. It might be worth trying to answer that question too.

E.g. before the dataset is made public, you could hide the thoughts in a crafted run and let another team fill in thoughts post-hoc. They could be rewarded for swaying evaluators to accept theirs as the original. This could also answer whether even humans are able to tell apart genuine motivations behind a decision vs made-up explanations; and provide another task dataset.

( C) Probably clear already but models like GPT3 can generate responses/stories while reflecting/talking to itself, and some already use it this way and only output the end results. Although that is probably not operating at the desired level. Fine-tuning is also fairly cheap so don't think one has to settle for GPT2. If the goal was interpretability of each generated token, perhaps the thoughts should also be derived from intermediate layers rather than being part of the sequence)