5.5.3 A Pragmatic View of AI Chatbots: Part 3

The text prompt that was used to create the precursor image was generated by the output of a chatbot.
The multiple ‘attention heads‘ of transformer models are crucial for modern chatbot function.

 A Few Words About How Chatbots Work

Publicly available commercial chatbots operate on gigantic computer systems that are in some ways similar to the graphic cards used on the high-end domestic machines on which gamers play. These systems are designed to carry out an absolutely enormous number of simple calculations simultaneously (or in parallel) at very high speed. They use a form of computing known as an artificial neural network

Input Tokens
Chatbot applications and the underlying transformer models are merely mathematical stores of word associations operating with a limited vocabulary of numerical input tokens. These tokens (or numbers) represent words or parts of longer words.  The token numbers are in themselves entirely arbitrary and so do not directly encode any useful information within the model, apart from the conversion of input text and the generation of output. Current estimates suggest that there are in the region of 100,000 to 200,000 tokens per model. Longer words are not even broken down by the tokenizer into parts that are meaningful to an appropriately educated person. The group of number tokens that form parts of larger words are merely numerically efficient clusters of letters that occur frequently in the language training set. The word ‘immunohistochemistry’ might be broken down by a human pathologist into ‘immuno’, ‘histo'(ology) and chemistry. However, that need not be the case for a chatbot because it depends on whether or not these fragments correspond to frequent letter combinations in the training set. “Text tokenization is deterministic. The sentence “The cat sat on the mat” will always tokenize the same way for a given model. This determinism allows the model to focus entirely on the relationships between tokens.” (Source: Google Gemini). Since chatbots use tokens (words and statistically relevant parts of words) and we use words or compound words, it would appear that human text use differs very significantly from AI applications, even at a conceptually abstract level.

Transformer Models at the Core of General Purpose Chatbots
Chatbots are based on transformer models of artificial neural networks that were described in a paper published by Google employees in 2017 entitled Attention is all you need. What these models do exceptionally well is capture the way that we use words by encoding patterns of word associations and word orders found in human-generated text. Transformer models which are at the heart of Large Language Model-based chatbots use the input tokens to refer to an extremely long string of numbers (usually called an embedding vectors). The vector embedding of the word cat, for example, in the trained model, is an enormous string of numbers, which makes no sense to a human. To visualize such a vector type, click on the words “Show the raw vector of «cat» in model MOD_enwiki_upos_skipgram_300_2_2021:” at this link. You can now see why I said earlier there is nothing to read in these models!

The strings of numbers (or embedding vectors) can be thought of as representing word associations in a high dimensional geometric space. This is hard for humans to envisage, as we normally can only easily think in three dimensions. These long number strings are often said to be representative of ‘meaning’ or some kind of ‘semantic relationship’. As a pragmatist, I find this unacceptably anthropomorphic. The transformer model is capturing the way in which words are strung together in natural languages. When an astronomically large number of word association examples are analysed during ‘training’ and used at the later stage of system output they are, from a human perspective, encoding information that was initially created by humans in text form. We should not attempt to trivialise the tremendous achievement and the almost indescribable number of calculations that have gone into creating systems that are capable of inducing informational contents of human texts. There is neither magic nor meaning, however the facility that we now have to use text to generate a novel text streams by a machine is a very considerable computational achievement.

These models carry out an enormous number of calculations, known as matrix multiplications (If you wish to know about simple matrix multiplication, see a very basic explanatory video).

Optimising the Output from In-Depth Inquiries

We have now entered the era of publicly available but very much slower Retrieval-Augmented Generation (RAG), for a modest subscription fee. RAG might include the addition of relevant documents on your part or an automated and extensive internet search forming part of preliminary ‘thinking’. At a technical level, a well-designed RAG system changes the probability of which text strings will be generated in the output, but does not abolish errors.  When the Google Gemini bot user options are set to ‘thinking’ and ‘Deep Research’, the preliminary text output now gives the impression that this bot is using a more sophisticated technique called Chain-of-RAG. [ See this example created for me, and compare it with the final output.] In this type of bot architecture, the initial query can be split into sub-queries and used in the retrieval of relevant source documents. The response to an initial document retrieval appears to iteratively influence the future steps of the generation process. With the present day general purpose transformer models, Chain-of-RAG should probably now be used for all professional purposes unless operating in a narrow domain that has all the needed pre- and post-training. RAG is essentially a productive, although not infallible, ‘workaround’ for the tendency of current chatbots to make both blatant and subtle errors.

There are also more sophisticated systems that are said to be Agenetic. These systems act as AI-agents and are more flexible than Chain-of-RAG, as they can run inquiries autonomously and call the use of tools through an Application Programming Interface (API). When an AI-agent is incapable of determining the result, it can pause natural language processing and instead generate strictly formatted code that allows external communication through an interface to the tool. In this way, the system as a whole can generate and execute computer code and retrieve the results that have been generated by the external tool.  When the results of the function call are received, they are then incorporated into the natural language output of the Chatbot. The ability to solve unstructured problems and potentially self-correct are an additional advantage of these automated systems. Of course, when a Chain-of-RAG operation within Google Gemini is executed using Google Search, the chatbot is also making iterative external function calls to retrieve external text sources.

Easy Further Reading and Resources

1. Large language models, explained with a minimum of maths and jargon, by Timothy B. Lee and Sean Trott at https://www.understandingai.org/p/large-language-models-explained-with

2. An intuitive overview of the transformer architecture by Roberto Infante
https://medium.com/@roberto.g.infante/an-intuitive-overview-of-the-transformer-architecture-6a88ccc88171

3. ‘This is not the AI we were promised’ A Royal Society Lecture by Professor Michael John Wooldridge
https://www.youtube.com/live/CyyL0yDhr7I?si=Xgi7upJpt3A0Y39G&t=560

More Advanced Resources 

4. There is a really excellent course of graphical videos about Neural Networks on Grant Sanderson’s 3Blue1Brown YouTube channel

5. Try typing in your own original prompt, try changing the attention heads and trying model characteristics in this very impressive graphic simulation of an LLM at https://poloclub.github.io/transformer-explainer/

A graphically illustrated video from the Neural Networks course by Grant Sanderson of the 3Brown1Blue YouTube Channel
Another graphically illustrated video from the Neural Networks course by Grant Sanderson of the 3Brown1Blue YouTube Channel

RAG vs Fine-Tuning vs Prompt Engineering: Optimizing AI Models

 A helpful video about agentic AI if you ignore the anthropomorphic language

 < Previous Part | Contents Index | Next Part >