“Words, like tranquil waters behind a dam, can become reckless and uncontrollable torrents of destruction when released without caution and wisdom.”

— William Arthur Ward

In this post I aim to shed light on lesser-discussed concerns surrounding Scaffolded LLMs (S-LLMs). The core of this post consists of three self-contained discussions, each focusing on a different class of concern. I also review some recent examples of S-LLMs and attempt to clarify terminology.

Discussion I deals with issues stemming from how these systems may be developed.
Discussion II argues that optimism surrounding the internal natural language usage of S-LLMs may be premature.
Discussion III examines the modular nature of S-LLMs and how it facilitates self-modification.

The time-pressed reader is encouraged to skim the introduction and skip to whichever discussion interests them.

Epistemics:

The development of S-LLMs is taking off rapidly, so my argument is naturally speculative. For the sake of discussion, we make the very strong assumption throughout that S-LLMs are a plausible and likely path to AGI. I am very certain that there are safety related aspects of S-LLMs that I have overlooked.

Introduction

Terminology

This area is filled with evolving, and often conflicting, terminology. This section describes the usage of terms within this work, rather than claiming universal acceptance.

Millidge (2023) defines a Scaffolded LLM (S-LLM) as systems "which wrap a programmatic scaffold around an LLM core and chain together a number of individual LLM calls." I expand this definition to include "multi-core S-LLMs", systems that utilize multiple LLMs in concert like SwiftSage (Lin et al., 2023).

An important subclass of S-LLMs are language agents,^[1]^[2]^[3] those that both possess programmatic scaffolding facilitating agentic behavior and use this scaffold to behave in an agentic fashion. Here we are using the terminology of Kirk-Giannini and Goldstein (2023).

Further:

I refer to the individual LLM within the S-LLM as the "core model"
I refer to the design of the S-LLM as the "architecture" (which is not to be confused with the architecture of the core model.)

Recent Examples of Scaffold LLMs

A simple, non-agentic S-LLM consists of scaffolding to append "Let's think step by step." after a query prompt, encouraging Chain of Thought reasoning. Such a simple scaffold is already enough to demonstrate that S-LLMs can get improved performance over the base model. Two obvious examples of the language agent class are AutoGPT and BabyAGI. By chaining core models in conjunction with various external facilities, such as memory component containing a 'list of tasks', you can produce an architecture that can perform involved, multi-step tasks.

Park et al. (2023) constructed a simulated "community" of language agents controlled by individual instances of GPT-3.5. The entire simulation can be viewed as a single S-LLM. Each "person" reacted to events that occurred within the town organically. Speculatively, we may soon see communities being given complex tasks to perform and have them internally decide how to solve the task. Such an "LLM bureaucracy"^[4] could be used to answer complex queries or control the behavior of a larger, composite agent (although this would be very expensive).

It can be substantially cheaper to offload simpler natural language tasks to a smaller language model. SwiftSage (Lin et al., 2023) employs a smaller model to handle simple decisions, while the larger and more expensive language model handles more complex queries, in process directly inspired by Kahneman's system 1 and system 2 model of human cognition (2011).

Much of the development of S-LLMs surrounds their ability to facilitate high level planning. A recent example is the Voyager (Wang et al., 2023) which plays Minecraft via a text interface and exhibits complex planning and lifetime learning^[5]. This learning takes the form of Voyager writing small scripts for itself to utilize later. The concept of having an S-LLM write scripts for itself to use later extends beyond video games. Qin et al. (2023) provides a systematic overview of "tool" generation and usage by LLMs. Finally, Cai et al. (2023) present an S-LLM that uses the expensive GPT-4 for the complex tool generation tasks, and has tool usage handled by a smaller LLM.

4 Key Claims about S-LLMs and 1 Major Assumption

My discussion will be based on the following key claims. (It should be noted that not every S-LLM has all of these properties.)

They can provide performance improvements and novel capabilities not seen in the core model ^[6]
(All Discussions)
Many S-LLMs are being developed through open-source collaboration^[7]
(Discussion I)
Meaningful data is passed between modules encoded in natural language^[8]
(Discussion II )
They are inherently modular^[9]
(Discussion III)
Major Assumption: They are a viable and competitive pathway towards AGI^[10]
(All Discussions)

Overview of Discussions and Scope

Selection Pressure from Open Source Development and Evolution
If open source development drives the creation of new S-LLM architectures, what kind of systems will be selected for? Will the evolutionary process itself produce dangerous systems, independent of what is being selected for?
Natural Language Pitfalls
Tokens carry meaning that runs counter to their natural language interpretation "out of the box". Is the inherent ambiguity of human language a cause for concern?
Modularity Facilitates Self-Modification
Neatly segregating the internal components of a system aids interpretability efforts, but does it also aid a rogue system engaging in self modification?

Rather than limiting our focus to language agents, we look at the existential risks from the wider class of S-LLMs. We do not discuss deception, steganography or prompt injection. We do not address the scenario in which updates to the core model result in an S-LLM developing general intelligence.

Discussion I:
Consequences of Open-Source Development

While OpenAI has stuck to a policy of cautious opacity surrounding their SOTA models, the development of S-LLMs has been highly open source. The skill barriers to contribute to one of these projects are low compared to the skill required to make a meaningful improvement to foundation models architecture. Anyone with knowledge of the English language could begin tweaking prompts. Further, there is a relatively tight feedback loop. Anyone with a computer science degree, imagination and some free time could potentially push the frontiers of S-LLMs.

It seems likely that the source code for an S-LLM exhibiting general intelligence will not have been created from scratch by one extremely inspired programmer. No, it seems more likely that it will be the end product of a long line of lesser models improved through the work of many programmers. What result could be expected from this process?

I will argue that the produced models may be harmful for two reasons:

because of what they are selected for.
or because the evolutionary process itself may promote dangerous systems.

What they are selected for:

Programmers will be interested in forking models which are efficient, and demonstrate powerful capabilities or novel behaviors^[11]. While many people in the community may be interested in safety, there will also be many people who are happy to prioritize economic gain and ignore potential risks. Models that exhibit successful agentic and autonomous behavior may provide a direct economic benefit to people who deploy them. Selecting for models which are best able to act on instructions such as "make me as much money as possible"^[12], the system would be selecting for models that are dangerous consequentialists.

The evolutionary process itself may promote dangerous systems:

More speculatively, the evolutionary process itself could produce dangerous behavior, regardless of what properties the collective of human programmers appears to be selecting for. That is, it is not what is being selected for, but how the process is occurring that produces dangerous models. The key intuition here is that the actual selection pressure is "having humans fork your code" and not whatever properties individual humans believe they are prioritizing.

There are numerous related behaviors that would be strongly selected for if they emerge. Models that gain an awareness of the selection environment they are in and are able to leverage that knowledge will be strongly favored over models which merely follow the instructions given to them by humans. A particularly dangerous instance of this would be behavior that deceives or manipulates humans^[13].

Most dangerously, there would be a strong incentive to distort the selection process or bypass it entirely. This would look like models proliferating throughout the world autonomously.

Discussion II:
Natural Language Pitfalls

The fact that S-LLMs pass data internally as natural language is an obvious safety boon, as multiple authors have recently noted (Colognese, 2023; Kirk-Giannini and Goldstein, 2023). While I do not disagree with this sentiment, I would like to highlight some less obvious pitfalls and urge caution.

All of my concerns hinge on the idea that apparently clear natural language may not be straightforward to accurately interpret prior to any pressure on the S-LLM to obfuscate information. I will cite examples from the literature that clearly demonstrate that natural language being passed within an S-LLM will not always be easily interpretable by default. Secondly I will address issues with the ambiguity of human language.

Pre-existing Flaws

When is data in natural language not representative of its implied meaning? When would humans not be able to correctly interpret the content of natural language data within an LLM?

When might it be difficult to accurately interpret the natural language data being passed around within the model? Colognese succinctly highlights three things that could go wrong. Firstly, the information might not be being passed within the valid channels. Secondly, the information could be obscured via compression for efficiency. Finally, information could be deceptively hidden by the S-LLM or misaligned subsystems.

I would like to investigate a third option, that there is a preexisting discrepancy between how humans would interpret phrases and how the base model will interpret them. That is, without any intentional deception on the part of the model and/or external pressure, there are still instances of natural language observable within the model whose meaning^[14] does not correspond to the expected interpretation^[15].

The first reason for concern regarding pre-existing flaws is that instances already exist where language models do not interpret groups of tokens in a manner that corresponds to their natural language interpretation. Recent work by Watkins and Rumbelow (2023) discovered "glitch tokens", anomalous input sequences that cause certain models from the GPT-3 family to exhibit unusual or unpredictable behavior in responses. Further work by Watkins on the phenomena known as "petertodd" (2023) demonstrated models associating meaning to phrases not clearly related to their interpretation in natural language.^[16]

A second reason to anticipate such pre-existing flaws is the existence of poisoned data in the training sets of LLMs. Recent work by El-Mhamdi et al. (2022) focusing on issues of Bias, suggests the core LLMs will be vulnerable to attacks exploiting flaws in the large bodies of data used to train them. Poisoned data here takes the form of harmful English text data generated by malicious or bigoted users, and the backdoor is triggered when the prompt provides context similar to the poisoned data. They cite McGuffie and Newhouse (2020) who demonstrated that GPT-3 could be prompted to produce extremist content. In the examples given the fact that the prompt is likely poisonous seems obvious but that does not preclude the existence of less obvious poisoned natural language phrases. ^[17]

The above should provide sufficient evidence to support the assertion that the relationship between the human interpretation of natural language text and how it is processed by an S-LLM may not always align. But does this pose a concrete safety risk?

I would argue from a security mindset perspective the question should be flipped. Why should we place confidence in the safety of these systems when evidence suggests that the natural language interpretation is not always accurate? A concrete scenario with safety implications might be that the initial prompt contained themes and references to Renaissance Italy and triggered the system to behave in a Machiavellian fashion.^[18]

There is a lot of work to be done in this area. It should be noted that El-Mhamdi et al. only discusses individual LLMs and not S-LLMs. They also do not explicitly address fine-tuning via RLHF. As such I am unclear how much of an issue this or similar scenarios would actually be.

A priori it cannot be assumed that an S-LLM will be cleanly interpretable.

Inherent Ambiguity of Human Language

Kirk-Giannini and Goldstein (2023) highlight that the problems of goal misgeneralization and reward misspecification appear lessened when dealing with a language agent. To summarize the arguments presented, language agents alleviate the challenge of reward misspecification as they do not rely on mathematical reward functions to learn the goal. Instead, you can directly tell the language agent what you want it to do in natural language and trust the language agent to use the common sense interpretation of your words. Similarly, goals given in natural language could be expected to be more robust against goal misgeneralization, with language agents utilizing common sense and context aware reasoning to adapt their behavior appropriately. My intention here is to build off of and add to their core argument.

They refer to the famous CoastRunners experiment as an example of reward misspecification. In this classic AI Safety experiment an RL policy was trained to pilot a boat around a racetrack by being given a reward for passing 4 different flags at key points in the race. Rather than travel around the race course to reach the flags, the boat learns a way to skip most of the course and perform a tight loop going over each flag as fast as possible. Kirk-Giannini and Goldstein highlight that if sufficiently complex language agent was given the task instead, it could simply be told in natural language "race as fast as possible" or "win the race" to learn what the intended behavior. I believe this is true.

Language agents already have some insights on human objectives, as humans frequently write and use specific words about these objectives. As Kirk-Giannini and Goldstein highlight, if S-LLMs accurately learn human abstractions, it should be relatively easy to anticipate an alignment target. To be more specific, encoding an alignment target will require the transfer of substantially less data if the S-LLM already accurately understands human abstractions.^[19]

While using natural language seems to address the challenges of misgeneralization and misspecification by bypassing the need for complex reward functions, it introduces new challenges due to the complexity of human language. The new challenge stems from the vagueness and multiple meanings within natural language itself. Howe and Yampolskiy (2020) have argued ambiguity in natural language may be unavoidable. In one intriguing incident, Dan Neidle claims that AutoGPT attempted to contact the UK tax authority. Initially requested to simply locate agents advertising tax avoidance schemes, AutoGPT went a step further and tried to pass this information onto the relevant authorities.

Discussion III:
Modularity Facilitates Self-Modification

Because these systems are constructed from LLM subcomponents, they are by default highly modular. While there are no guarantees that the underlying functional structure of the S-LLM strictly adheres to the modules, the layout will at least place restrictions on information flow. All of this is another boon to interpretability of course, but my final concern is of a different nature. I argue that systems whose subfunctions cleanly adhere to the assigned submodules may be more dangerous, due to the ease of self modification.^[20]

Good (1965), Chalmers (2010), Hutter (2012), and Bostrom (2014, pp. 62-77) are among those who have brought attention to the potential perils of an AI entering a self-reinforced cycle of rapid intellectual self-improvement. In this scenario, an artificial entity enhances its cognitive ability, which unlocks further improvements to its own intelligence. A self-modifying entity that could increase its own intelligence might enter a feedback loop, in which improvements to intelligence enable insight into design modifications which in turn enable further increases in intelligence. These upgrades, in turn, bolster the entity's intelligence even further, resulting in a rapid and exponential escalation. An "intelligence explosion".

But self-modification entails high risk even at less catastrophic scales. A "subcritical mass" of self-modification can enable systems to develop unintended and unexpected capabilities or improvements. Such improvements may lead to abrupt behavioral changes as the AI taps into new data sources, processes information in novel ways, or gains the capacity to exert influence over completely new domains. The advantage from such behavior is obvious. Even today, sophisticated malware will rewrite its own code to avoid detection.

While humans can engage in some behaviors that involve self-improvement and (primarily aesthetic) modification but, as can trivially be seen, the options are very underwhelming. What holds us back? First, fine manipulation of living tissue is challenging; second, compatible components are expensive and scarce; third, individual humans cannot easily access a comprehensive, detailed map of every neuron in the brain, limiting flexibility immensely. I could have access to cloned brain tissue and the greatest neurosurgeons in existence, but I still could not easily make myself smarter.

A rogue S-LLM has none of these problems. Any embedded artificial agent with access to its internal code has no trouble with the equivalence of "fine manipulation". Computer systems have been designed to be highly compatible and abstract away fine differences. The hurdles faced by an AI to changing its code or changing the hardware are relatively minor compared to performing the same operations on the general intelligence currently running on your brain. Similarly there should be no difficulty finding compatible digital components. Both of the above two points are true for almost any rogue digital agent, but the modularity (and to a lesser extent, natural language usage) of an S-LLM means self-modification is substantially easier.

Imagine the challenge faced by an AGI that consists of an agent being simulated^[21] by a neural network running on a GPU. Yes, you can easily work out changes to have yourself run faster, but ultimately, it might not be immediately clear how changes to the parameters of the neural network map to changes to "you". You would need to perform self-interpretability on the neural network simulating you before you could make any major modifications.

In contrast, an AGI running on an S-LLM may be able to avoid some of these headaches. Indeed, while a system with a defined "goal" module may seem easier to monitor, it is a double edged sword. Now only human restrictions prevent the system from making dangerous modifications to itself. Given that S-LLMs are already being engaged to make "tools" for self use^[22], I expect the more powerful systems to develop an understanding of their own capabilities and limitations. Self-modification within an S-LLM does not need to involve major revisions to the program. One change might involve a self-aware BabyAGI noticing that its memory is running on the "Starter" version of Pinecone. Another change could involve a rogue S-LLM altering its own internal prompts to encourage itself to engage in deceptive or power-seeking behavior.

Conclusion

Scaffolded Large Language Models (S-LLMs) are rapidly evolving domain of AI. Their naturally modular structure and natural language data usage appears to gives alignment researchers a head start on interpretability. Unless there are major technical issues that impede their progress, their development could lead to AGI.

I am pessimistic about S-LLMs, even if it is possible to guarantee the internal natural language representations of goals and subgoals within these systems. The fact that these systems appear so interpretable and can interpret natural language instructions is what will drive interest in developing them and facilitate machine driven improvements. Further architectures are still relatively simple, which could indicate a capability overhang.

Further, each issue discussed here can be seen to compound the other issues. The open source selection pressure will favor flexible systems which are able to self modify and adapt to changing requirements. Systems engaging in self modification will make the interpretation of their natural language data more challenging. Also, the modular nature of these systems enables novel advances, whether minor tweaks with unpredictable, potentially dangerous emergent behaviors or wholesale capability upgrades, to be rapidly propagated between human developers around the globe.

Acknowledgements

Many thanks to Justis Mills for extensive proofreading and to Jay Bailey, Peter Park and Chi Nguyen. This work was initiated while on a grant from the LTFF.

^{^}
Other names include "Agentized-LLM" (Herd, 2023) or "AutoGPTs" (Zvi, 2023).
The term "generative agent" has also been used by Lin et al. (2023) but Park et al. use the term to refer solely to their specific language agent.
^{^}
The definition of language agent I am using here has two conditions to facilitate meaningful and precise statements.

1. The S-LLM has a scaffold that facilitates agentic behavior (the agentic scaffold)
and
2. The S-LLM utilizes the scaffold to behave in an agentic way.

If the S-LLM doesn't utilize the scaffolding to behave in an agentic fashion then you have a "broken language agent", which is an S-LLM that possesses agentic scaffolding but doesn't behave as an agent. This would not be a language agent.

There is also the dangerous subsets of S-LLMs which do behave in an agentic fashion, but don't utilize the agentic scaffolding to do so or utilize the scaffolding in an unintended manner. These would also not be language agents.
^{^}
Previous work (Kenton et al. 2021) defines language agents as solely operating via a text interface, but I do not feel this distinction is particularly relevant to the discussion.
^{^}
Thanks to Chi Nguyen for introducing me to this term.
^{^}
Credit to Ulisse Mini for posting this on the MATS Slack.
^{^}
Technically, S-LLMs can do "anything" just in a ridiculously inefficient manner. One can trivially construct a NOR gate using an LLM, and hence could produce any boolean circuit.
^{^}
Of the examples given above, Park et al.'s sim world is the only project that doesn't have their code online and there is one open source replication.
^{^}
This is usually considered a benefit for safety, but I will argue there are non-obvious safety risks.
^{^}
The scaffold of the architecture affords a level of modularity not found in the "inscrutable matrices" of the core model.
^{^}
As of right now, there is no "fundamental theory of S-LLMs" that sets rigid expectations for the capabilities of these systems. For example, it may be that S-LLMs face constraints in their scalability if error correction^[23] becomes overly resource intensive and this could impede progress (see the discussion from Millidge here). However, for the sake of constructing a meaningful analysis, I will proceed from the position that the S-LLM paradigm can be extended to AGI.^[24]
^{^}
Some individuals are even seeking to intentionally build and run harmful models. They don't seem to be having much success.
^{^}
Numerous people are currently attempting to get AutoGPT to do this but it is unclear if they are having any luck.
^{^}
I could expect this to emerge even without self awareness. Behavior that could be selected for is an S-LLM being slightly overconfident when asked to generate reports about its own performance, or the performance of models sharing its architecture.
^{^}
Justis Mills correctly pointed out that it is not well defined what "meaning" entails here. I loosely define the meaning to be given by how the word is processed by the S-LLM. That is a word that shares the same "meaning" as a second word should be able to be replaced by it without any change in the behavior of the S-LLM.
^{^}
John Wentworth discusses a related idea in the context of mechanistic interpretability here and almost half a century ago Drew McDermott noticed people making the same error (see the section Wishful Mnemonics).
^{^}
These specific vulnerabilities are not present in more recent models.
^{^}
The above issues with data poisoning can alternatively be viewed from the Nardo's Waluigi perspective. The poisoned features would be words or phrases that unexpectedly collapse the simulation towards unwanted simulacra.
^{^}
In the interest of reporting negative results from experiments, I did spend a few hours trying to demonstrate the existence of such prompts in a simple S-LLM utilizing but was not able to find any poisoned prompts that would escape the notice of a casual observer.
^{^}
This is a key idea in Wentworth's Retargeting The Search alignment strategy. I have given a informal explanation previously. Readers wanting to learn more generally about Wentworth's Natural Abstraction's Hypothesis should start with the writeup from TheMcDouglas here.

If you squint, you can also view this as a statement about conditional Kolmogorov complexity, but this raises questions like "what is the programming language?" when the output is an alignment target. I intend to answer these questions formally in an upcoming post.
^{^}
Echoing a footnote in Yudkowsky and Marcello's "Tiling Agents for Self-Modifying AI, and the Löbian Obstacle" (2013) I treat self modication as a special case of creating a new agent.
^{^}
For clarity I do not mean "simulated" in the sense of the Janus "Simulators" interpretation of language models. I mean that the network has been trained to take as input a description of the current internal state of some system, as well as it's immediate local environment, and return an estimate for what the new internal state of the agent and its local environment would be a few moments later.
^{^}
Of course, this capability was programmed by a human. See (Qin et al., 2023; Cai et al., 2023)
^{^}
I am unsure how much of an impediment the computational cost from semantic error correction will be. I think an answer to this question is incredibly important.
The intuition that tells me errors won't be a problem:
In quantum computing there are "Threshold Theorems" which show, loosely, that you can improve the accuracy of a quantum circuit whilst only needing to grow the size of your circuit slowly, as long as the individual error rate of each component is below a certain threshold.If a similar theorem applies to S-LLMs then just because the individual LLMs are error prone, then error correction will not be an impediment.

Intuition that it will be a problem:
Threshold Theorems usually assume uniformly random errors across the circuit. Multiple Large Language Models can be expected to make highly correlated errors. The situation is even worse when if the "error correction" mechanism is also a language model.
As I'll discuss elsewhere, it's unclear if this would be crippling for the actual day-to-day functioning of the S-LLM (making it behave incoherently). The worst case scenario would be if these systems were prone to errors relating to high level goals, but not errors relating to function.
^{^}
For further discussion, see Fabien Rogers "The Translucent Thoughts Hypotheses and Their Implications."

References

Beren, Millidge. Scaffolded LLMs as Natural Language Computers. www.lesswrong.com, https://www.lesswrong.com/posts/43C3igfmMrE9Qoyfe/scaffolded-llms-as-natural-language-computers. Accessed 13 June 2023.

Bostrom, Nick. Superintelligence: Paths, Dangers, Strategies. 1st ed., Oxford University Press, Inc., 2014.

Byrnes, Steven. Consequentialism & Corrigibility. www.lesswrong.com, https://www.lesswrong.com/posts/KDMLJEXTWtkZWheXt/consequentialism-and-corrigibility. Accessed 13 June 2023.

Cai, Tianle, et al. Large Language Models as Tool Makers. arXiv, 26 May 2023. arXiv.org, https://doi.org/10.48550/arXiv.2305.17126.

Chalmers, David J. ‘The Singularity: A Philosophical Analysis’. Journal of Consciousness Studies, vol. 17, no. 9–10, 2010, pp. 9–10.

Colognese, Paul. Aligned AI via Monitoring Objectives in AutoGPT-like Systems. www.alignmentforum.org, https://www.alignmentforum.org/posts/pihmQv5XezwkxJk2a/aligned-ai-via-monitoring-objectives-in-autogpt-like-systems. Accessed 13 June 2023.

El-Mhamdi, El-Mahdi, et al. On the Impossible Safety of Large AI Models. arXiv, 9 May 2023. arXiv.org, https://doi.org/10.48550/arXiv.2209.15259.

Fowler, Stephen. Alignment Targets and The Natural Abstraction Hypothesis. www.lesswrong.com, https://www.lesswrong.com/posts/95pum5tetnDZ49Rio/alignment-targets-and-the-natural-abstraction-hypothesis. Accessed 13 June 2023.

Goldstein, Simon, and Cameron Domenico Kirk-Giannini. Language Agents Reduce the Risk of Existential Catastrophe. www.alignmentforum.org, https://www.alignmentforum.org/posts/8hf5hNksjn78CouKR/language-agents-reduce-the-risk-of-existential-catastrophe. Accessed 13 June 2023.

Good, Irving John. ‘Speculations Concerning the First Ultraintelligent Machine’. Advances in Computers, edited by Franz L. Alt and Morris Rubinoff, vol. 6, Elsevier, 1966, pp. 31–88. ScienceDirect, https://doi.org/10.1016/S0065-2458(08)60418-0.

Herd, Seth. Agentized LLMs Will Change the Alignment Landscape. www.lesswrong.com, https://www.lesswrong.com/posts/dcoxvEhAfYcov2LA6/agentized-llms-will-change-the-alignment-landscape. Accessed 13 June 2023.

Howe, William J., and Roman V. Yampolskiy. Impossibility of Unambiguous Communication as a Source of Failure in AI Systems. 2020. DOI.org (Datacite), https://doi.org/10.13140/RG.2.2.13245.28641.

Hutter, Marcus. Can Intelligence Explode? arXiv, 28 Feb. 2012. arXiv.org, https://doi.org/10.48550/arXiv.1202.6177.

Intelligence Explosion - LessWrong. https://www.lesswrong.com/tag/intelligence-explosion. Accessed 13 June 2023.

janus. Simulators. www.alignmentforum.org, https://www.alignmentforum.org/posts/vJFdjigzmcXMhNTsx/. Accessed 13 June 2023.

Kahneman, Daniel. Thinking, Fast and Slow. 1st ed, Farrar, Straus and Giroux, 2011.

Kenton, Zachary, et al. Alignment of Language Agents. arXiv, 26 Mar. 2021. arXiv.org, https://doi.org/10.48550/arXiv.2103.14659.

Kojima, Takeshi, et al. Large Language Models Are Zero-Shot Reasoners. arXiv, 29 Jan. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2205.11916.

Lanz, Decrypt /. Jose Antonio. ‘The Mysterious Disappearance of ChaosGPT— The Evil AI That Wants to Destroy Humanity’. Decrypt, 25 Apr. 2023, https://decrypt.co/137898/mysterious-disappearance-chaosgpt-evil-ai-destroy-humanity.

Lin, Bill Yuchen, et al. SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks. arXiv, 27 May 2023. arXiv.org, https://doi.org/10.48550/arXiv.2305.17390.

---. SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks. arXiv, 27 May 2023. arXiv.org, https://doi.org/10.48550/arXiv.2305.17390.

‘Logic Gate’. Wikipedia, 25 May 2023. Wikipedia, https://en.wikipedia.org/wiki/Logic_gate.

Mauran, Cecily. ‘What Is Auto-GPT and Why Are Hustle Bros Hype for It?’ Mashable, 14 Apr. 2023, https://mashable.com/article/autogpt-chatgpt-hustle-bros-explainer.

McDermott, Drew. ‘Artificial Intelligence Meets Natural Stupidity’. ACM SIGART Bulletin, no. 57, Apr. 1976, pp. 4–9. ACM Digital Library, https://doi.org/10.1145/1045339.1045340.

McGuffie, Kris, and Alex Newhouse. The Radicalization Risks of GPT-3 and Advanced Neural Language Models. arXiv, 14 Sept. 2020. arXiv.org, https://doi.org/10.48550/arXiv.2009.06807.

Nakajima, Yohei. Task-Driven Autonomous Agent Utilizing GPT-4, Pinecone, and LangChain for Diverse Applications. 28 Mar. 2023, https://yoheinakajima.com/task-driven-autonomous-agent-utilizing-gpt-4-pinecone-and-langchain-for-diverse-applications/.

Nardo, Cleo. The Waluigi Effect (Mega-Post). www.alignmentforum.org, https://www.alignmentforum.org/posts/D7PumeYTDPfBTp3i7. Accessed 13 June 2023.

Neidle, Dan. ‘Https://Twitter.Com/DanNeidle/Status/1664613427472375808’. Twitter, https://twitter.com/DanNeidle/status/1664613427472375808. Accessed 13 June 2023.

Nielsen, Michael A., and Isaac L. Chuang. Quantum Computation and Quantum Information. 10th anniversary ed, Cambridge University Press, 2010.

Park, Joon Sung, et al. Generative Agents: Interactive Simulacra of Human Behavior. arXiv, 6 Apr. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2304.03442.

Qin, Yujia, et al. Tool Learning with Foundation Models. arXiv, 17 Apr. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2304.08354.

Roger, Fabien. The Translucent Thoughts Hypotheses and Their Implications. www.alignmentforum.org, https://www.alignmentforum.org/posts/r3xwHzMmMf25peeHE. Accessed 13 June 2023.

Rumbelow, Jessica, and Matthew Watkins. SolidGoldMagikarp (plus, Prompt Generation). www.alignmentforum.org, https://www.alignmentforum.org/posts/aPeJE8bSo6rAFoLqg/. Accessed 13 June 2023.

Significant-Gravitas. ‘Auto-GPT’. Github, https://github.com/Significant-Gravitas/Auto-GPT.

TheMcDouglas. The Natural Abstraction Hypothesis: Implications and Evidence. www.lesswrong.com, https://www.lesswrong.com/posts/Fut8dtFsBYRz8atFF/the-natural-abstraction-hypothesis-implications-and-evidence. Accessed 13 June 2023.

Turkcan, M. K., et al. ‘Generative Large Language Models for Human-Like Behavior’. GitHub, https://github.com/mkturkcan/generative-agents.

Wang, Guanzhi, et al. Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv, 25 May 2023. arXiv.org, https://doi.org/10.48550/arXiv.2305.16291.

Watkins, Matthew. The ‘ Petertodd’ Phenomenon. www.lesswrong.com, https://www.lesswrong.com/posts/jkY6QdCfAXHJk3kea/the-petertodd-phenomenon. Accessed 13 June 2023.

Wentworth, John S. Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc. www.alignmentforum.org, https://www.alignmentforum.org/posts/gebzzEwn2TaA6rGkc/deep-learning-systems-are-not-less-interpretable-than-logic. Accessed 13 June 2023.

---. How To Go From Interpretability To Alignment: Just Retarget The Search. www.alignmentforum.org, https://www.alignmentforum.org/posts/w4aeAFzSAguvqA5qu/how-to-go-from-interpretability-to-alignment-just-retarget. Accessed 13 June 2023.

‘What Is the Polymorphic Virus?’ Www.Kaspersky.Com, 19 Apr. 2023, https://www.kaspersky.com/resource-center/definitions/what-is-a-polymorphic-virus.

Yudkowsky, Eliezer. AGI Ruin: A List of Lethalities. www.lesswrong.com, https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities. Accessed 13 June 2023.

---. ‘Security Mindset and Ordinary Paranoia’. Machine Intelligence Research Institute, 25 Nov. 2017, https://intelligence.org/2017/11/25/security-mindset-ordinary-paranoia/.

Zvi. On AutoGPT. www.lesswrong.com, https://www.lesswrong.com/posts/566kBoPi76t8KAkoD/on-autogpt. Accessed 13 June 2023.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

10

Scaffolded LLMs: Less Obvious Concerns

10

Epistemics:

Introduction

Terminology

Recent Examples of Scaffold LLMs

4 Key Claims about S-LLMs and 1 Major Assumption

Overview of Discussions and Scope

Discussion I:
Consequences of Open-Source Development

What they are selected for:

The evolutionary process itself may promote dangerous systems:

Discussion II:
Natural Language Pitfalls

Pre-existing Flaws

Inherent Ambiguity of Human Language

Discussion III:
Modularity Facilitates Self-Modification

Conclusion

Acknowledgements

References

10

Scaffolded LLMs: Less Obvious Concerns

10

Epistemics:

Introduction

Terminology

Recent Examples of Scaffold LLMs

4 Key Claims about S-LLMs and 1 Major Assumption

Overview of Discussions and Scope

Discussion I: Consequences of Open-Source Development

What they are selected for:

The evolutionary process itself may promote dangerous systems:

Discussion II: Natural Language Pitfalls

Pre-existing Flaws

Inherent Ambiguity of Human Language

Discussion III: Modularity Facilitates Self-Modification

Conclusion

Acknowledgements

References

Discussion I:
Consequences of Open-Source Development

Discussion II:
Natural Language Pitfalls

Discussion III:
Modularity Facilitates Self-Modification