How does an LLM perform actions?

You, Fri May 10 2024 • ai llm gpt

TLDR; LLM decides which actions to take and outputs text instructions. ChatGPT parses those instructions and executes the actions using specialized tools.

How does an LLM-based system like ChatGPT perform actions such as generating images or searching the internet?

Let's run through an analogy.

Imagine an LLM as a highly intelligent printer. This PrinterLLM operates by receiving your instructions on a piece of paper (your prompt) and then printing out another piece of paper with the response.

Despite its limitation to paper communication, PrinterLLM is remarkably versatile:

Give it a list of ingredients, say tomatoes, pasta, and cheese, and it'll print out a delicious recipe for you.
Show it a menu, and it can suggest improvements, like adding a special dish or highlighting popular items.

Moreover, PrinterLLM doesn't start from scratch for each task. It accepts input only from a special sheet of paper that always describes its capabilities and personality at the top, with user-specific request underneath.

But how does PrinterLLM perform actions?

Well, it doesn't. Remember, an LLM can only print text. But what it can do is be our intelligent decision maker!

We can give PrinterLLM a piece of paper describing a situation and ask it to decide what to do next. If we describe a situation where the dishes need washing, PrinterLLM might print, "It's time to wash the dishes." And then it's up to us to act on this advice.

ChatGPT functions similarly to PrinterLLM. It takes your text input, passes it down to an LLM along with internal instructions and personality rules and then outputs the LLM's textual response.

When you ask ChatGPT to do an action such as finding the latest news about Paris Fashion Week, it adds the hidden instructions to your prompt, and sends both to an LLM. The LLM then decides an action should be taken and generates text describing the action like "action:browse_web, prompt=latest news on Paris Fashion Week."

ChatGPT reads this text and recognizes that it's not a typical response and instead describes an action. It parses the text to get the action name and the prompt, which tell it to forward the prompt to a web search tool like Bing. When Bing comes back with the results, ChatGPT passes them again to the LLM and asks it to format them into a conversational, user-friendly structure.

The LLM doesn't perform actions itself; it facilitates the actions through textual commands.

While ChatGPT might seem like it's directly performing tasks such as web browsing or image creation, it's actually more like a director in a play, cueing various actors (specialized tools) to take the stage and perform their roles.

It's a master at converting your text prompts into action directives that these tools can understand and execute.

The real magic happens when these capabilities are woven together to create a seamless and interactive experience, much like a well-orchestrated kitchen run by our efficient and resourceful PrinterLLM.