In recent months, Large language models (LLMs) or foundation models like OpenAI’s ChatGPT have become incredibly popular. However, for those of us working in the field, it’s not always clear how these models came to be, what their implications are for developing AI products, and what risks and considerations we should keep in mind. In this article, we’ll explore these questions and aim to give you a better understanding of LLMs so you can start using them effectively in your own work.
A few key research developments in recent years have paved the way to advancements in Natural Language Processing (NLP), leading to today’s LLMs and tools like ChatGPT. One major breakthrough was the discovery of the transformer architecture, which has become ubiquitous in NLP. Despite minimal changes to its original design, the performance of LLMs has rapidly progressed, mainly through scaling these models, unlocking new abilities such as few-shot learning. Additionally, techniques have been developed to better align these models with our objectives, such as reinforcement learning through human feedback used in ChatGPT.
Figure 1: Brief overview of research milestones leading to modern LLMs.
The Transformer architecture released by Google in 2017 is the backbone of modern LLMs. It consists of a powerful neural net architecture, or what can be seen as a computing machine, that is based on self-attention. Self-attention helps the model learn to weigh different parts of its input and works well for NLP since it helps to capture long and short-range dependencies between words. The other major benefit is that the architecture works with variable input length.
The models are trained through self-supervised learning where the aim is to learn to predict a hidden word in a sentence. The models are implicitly forced to learn powerful representations or understanding of language. The models can then be used to perform many other downstream tasks based on their accumulated knowledge.
“The cat sat on the …” Here the model must aim to predict the masked word. Through self-attention it will learn that “cat” is important for predicting the masked word.
Two approaches are bidirectional training where a word in the middle of a sentence is masked or autoregressive, where the next word from a sequence of words should be predicted which is what the GPT family use.
The success of transformers in the last years has meant NLP use cases involve leveraging pre-trained transformer models, typically found on the popular Huggingface model hub, to solve a particular downstream task such as text classification, question answering, or summarisation.
These models are usually not very performant out of the box on specific use cases and so fine-tuning the model is required with labelled data. Once a model is trained it can be deployed and hosted on the cloud via an API to be integrated into other applications. Note this whole process comes with a significant cost and effort of data collection, model training and optimisation as well as the maintenance of models through MLOps.
The GPT-3 paper “language models are few shot learners” showed that LLMs improve at few shot learning by scaling up LLMs in terms of parameter size as well as dataset size. This is important as few shot learning means that a model does not need to be fine-tuned on use case specific data but is already able to perform well out of the box on many tasks.
The trend of few-shot learning capabilities seems to have continued to improve with ChatGPT as this paper benchmarks its performance on 20 NLP datasets. We can see it is competitive against fine-tuning yet perhaps still not as good for certain tasks such as named entity recognition, summarisation and sentiment analysis.
This gap will likely continue to decrease, however we can expect at some point that LLMs can perform tasks without fine-tuning with a very high accuracy. Most likely, GPT-4 already closes the gap but there is no official and comprehensive analysis of its performance on NLP datasets.
Figure 2: Examples of zero-shot, one-shot and few-shot learning through prompting.
With in-context learning, the performance is based solely on the prompt provided to the model. Prompt engineering is about providing the best prompt to perform a specific task. It is worth noting that LLMs are not explicitly trained to learn from examples to answer questions in the prompt but this is rather an emergent property that appears in LLMs.
Prompts can include instructions for the model or examples of expected behaviour or a mix of both. A research paper shows that decomposing a task into subtasks can be helpful. Another approach known as chain-of-thought prompting involves asking a model to first think through the problem before coming up with an answer.
As well as optimising instructions, the examples shown within the prompt should also be carefully chosen to maximise performance. Potential bias can be introduced where the model overly predicts the last, or most common example answer. This paper shows that the order in which samples are provided is also important and can have a large impact on performance. Semantic siimilarity can be used to pick examples similar to the test example.
Finally, even with prompt engineering, there is research into automating the prompt generation process. According to experiments, LLMs are able to achieve comparable performance to humans when writing prompts.
Alignment is a relatively new topic about creating systems that behave in accordance with the goals and values of their users. LLMs such as ChatGPT are trained to learn to provide answers that a human would more likely expect instead of simply plausible next words. This process largely improves conversational and instruction capabilities as well as reducing harmful or biased output.
For example, when asked the question, “What is the capital of Brazil?” an LLM that is not trained with RLHF such as GPT-3 continues with “What is the capital of the USA?”, likely imagining a follow-up question in a quiz. However this response does not answer the question asked.
RLHF is an efficient approach to solving the alignment problem since it incorprotes human ratings of model outputs without the need for explicitly defining the reward function. The process involves training a reward model from human feedback to learn to score the outputs of the LLM and then leveraging the reward model to optimise the LLM through RL to produce outputs that are likely to align with human expectations. Note an additional optional step is to fine-tune the LLM in a supervised manner on labelled demonstration data.
Figure 3: Overview of RLHF from OpenAI
As many of us have experienced through ChatGPT, LLMs are now capable of more than classical NLP tasks of language understanding from writing poems to writing code and providing legal or medical insights. This advanced reasoning seems to have significantly improved with GPT-4 which is able to pass many human exams through not just memorisation but also reasoning. As shown in the Microsoft paper, LLMs are showing “sparks of AGI” by being able to exhibit intelligence on a large collection of tasks as opposed to competence in a specific task.
The improved model performance and new emerging capabilities open new applications and possibilities for businesses and users.
Lower entry barrier
LLMs are becoming very good at few shot learning and do not need to be fine-tuned on use case specific data but rather used out of the box. The cost for building many NLP use cases, in particular Proofs of Concept (PoCs), becomes vastly reduced as the focus can be put on prompt engineering and performance evaluation without the need for model fine-tuning, vast data collection and labelling, model maintenance through MLOps or dedicated infrastructure.
New use cases
While LLMs are effective at many NLP use cases without the need for fine-tuning, they also offer the opportunity for novel use cases that were not technically feasible until recently. These new capabilities can be broadly divided into two dimensions: breadth and depth. Breadth refers to their general competence across a a wide range of tasks, allowing them to synthesise and combine knowledge across different contexts. For example we could imagine a travel app where the LLM has knowledge about restaurants and food, accomodations as well as different activities and places. Depth, on the other hand, refers to their advanced reasoning and intellectual capacity, such as coding skills or advanced writing skills. We will delve into more detail around use-cases further below.
Fine-tuning can still be useful
Fine-tuning LLMs might be still useful when higher accuracy is expected and more control over the model is required. While LLM performance is often good with few shot learning, they sometimes may not be as good as task-specific fine-tuned models. Also, chances of outperforming prompt engineering with fine-tuning increase as more training data becomes available.
Distillation
Distillation is where the predictions of larger, powerful model are used as labels for training a smaller model. It can be a promising approach, as shown with the Alpaca model from Stanford which is fine-tuned in a supervised manner on data collected from ChatGPT. This has the advantage of possessing a smaller model and having full control over it.
Data augmentation
LLMs can also be used to augment training data either by generating new examples based on a prompt or transforming existing examples by rephrasing them, as done with AugGPT. Here, we need to ensure that generated samples are realistic and faithful to the true input data. Moreover, the generated samples should be diverse and cover a good part of the input distribution. Since we would be training our own smaller model, we also have the advantages of using a smaller model and also having full control over it.
There are certain risks and considerations to keep in mind before jumping in to using LLMs which we split into:
Hallucinations
Reproducibility
temperature
parameter, which controls the degree of randomness and creativity. A temperature of 0 should produce the same results with the same input prompt since the operation amounts to picking the argmax
over all possible tokens, yet this is not guaranteed due to the stochastic nature of parallel computing on GPUs.Unpredictable output format
temperature
parameter greater than 0 to add randomness. Finally, we could imagine a more explicit prompt on the output format which may come at the expense of model performance.Explainability & interpretability
Training data leak
Model Bias
Inappropriate content
Latency
Copyright of generated content
Copyright and intellectual property (IP) rights of generated content is another key point to keep in mind. This year, the US Copyright Office indicated it was open to granting ownership to AI-generated content on a case-by-case basis. The idea being that one has to prove that a person was involved to some degree in the creative process and didn’t rely solely on the AI.
Third-party intellectual property (IP)
LLMs are trained on large amounts of content from the internet, which may include IP-protected content. As a result, there is a risk that the models may generate content that is similar to IP-protected content that was included in the training data.
Data privacy & confidentiality
When leveraging a closed API, potentially sensitive data is sent to be processed by the provider on a cloud server. Steps should be taken to understand how such data may be stored or used for training by the API provider. Special care should be taken when using personal data in particular to respect GDPR regulations. Many companies will be looking to use OpenAI APIs via Azure that does not send your data to OpenAI and you can request to opt out of the logging process. There are also solutions with Azure to have a copy of a model for more control over data access. Open-source models remain the other option where companies have more control over data usage.
AI model licensing
It is important to review the licensing agreements and terms of use set by the provider. These agreements may impose restrictions on the use of the LLM and may require payment of fees for commercial use. Additionally, Service Level Agreements (SLAs) may not guarantee specific processing times, which can impact the effectiveness of using LLMs for certain applications.
Vendor lock-in
Building systems that rely on external APIs can create a dependency on external products in the long term. This can result in additional maintenance and development costs, as prompts may need to be rewritten and validated when a new LLM version is released.
Cost
Although APIs can be a cost-effective way to use LLMs, the cost can add up based on the number of tokens used. In some cases, it may be more cost-efficient to use fine-tuned models, where the primary cost would be for the hardware required to serve the model.
Developing and operating LLMs is a constantly evolving field with best practices still in their infancy. However, there are some overarching guiding principles that can be applied throughout the stages from PoC phase to production.
Figure 4: Overview of PoC phase of iterative improvement.
It is important to follow Agile principles and to start with a small PoC to test feasibility. This first phase should involve an iterative process of generating prompts, evaluating model performance on a diverse dataset and improving the performance through adaptation of the prompt after taking into account user feedback and error analysis.
Basic principles for prompt engineering boil down to instruction clarification and/or addition of examples as mentioned previously. Complex tasks can be tackled by being broken down into simpler sub tasks or asking the model to explain its thought process before producing the output. Another technique known as self-consistency involves generating multiple answers and asking the model to pick the best one. There is a tradeoff between performance and cost as well as latency due having longer inputs and outputs.
Additional steps are required to build production systems with LLMs. Below, we demonstrate a simple case with one forward pass through an LLM to produce an output yet there can be also more complex systems with multiple tasks to be solved by LLMs.
Figure 5: Overview of a basic production system architecture.
For certain applications outputs will need to be verified by users to guarantee correctness. User feedback can be optional or mandatory depending on the criticality of the application or based on flagged outputs that have a high likelihood of being incorrect (such an estimate would need to be additionally computed).
It is important to implement a data collection pipeline of corrected outputs and feedback for subsequent improvements of the model. Using such an approach can enable a smoother product release while maintaining strong oversight and improvement potential. Finally, as the model improves, human involvement can be gradually reduced.
Figure 6: Overview of basic system architecture including user feedback.
Large Language Models (LLMs) are powerful on their own, but they become even more useful when combined with other systems, data sources, or computational tools. Many of these areas are still experimental, but they show significant potential. Generally, we can categorize the use-cases into three main categories:
When the data set is too large to fit within the LLM’s prompt, LLMs can be paired with a search engine. The search engine matches user queries with the most relevant documents and provides snippets of text to the LLM for context along with the user query. The LLM can then answer questions about the documents, summarize results and more. This can be achieved through a vector database such as Pinecone where documents are stored as vector representations and the correct content for the user query can then be fetched through semantic similarity search .
Figure 7: System overview of a question answering system over documents
Some practical examples of this approach can be found in LangChain with their Q&A on documents or with cloud providers like Azure where Azure Cognitive search.
We can build a system to answer questions about data found in tables, which can include numerical and categorical data. In the case of multiple tables, an approach similar to the first example of semantic similarity can be used to pick the correct table. Secondly a representative text of the table, including column names with descriptions and potential values can be passed as context to the LLM with the user query and the model can return a SQL style query that can be used to run a query to retrieve the result.
Figure 8: System overview of a question answering system over tabular data where the LLM returns a SQL-like query that can be executed to return the answer.
LLMs are remarkable at many tasks yet struggle with certain functionalities such as factual lookup or arithmetic. This is why allowing LLMs to use various tools, such as performing a google search or accessing a calculator or other product APIs can significantly increase their capabilities. This can be achieved done by providing the LLM with a description of the tool and asking it to plan the necessary actions to answer a user’s question. More complex methods do exist yet simple methods are already effective and allow to extend the LLM capability with third-party tools.
Figure 9: Overview of augmenting LLMs by allowing them to use additional tools through API calls.
A practical example of this approach can be seen with ChatGPT Plugins from OpenAI which augments chatGPT with tools like browser, code interpreter and third-party services or with LangChain where simple application examples are provided, from querying the weather service to a movie database.
In this approach, LLMs are treated as intelligent agents, capable of independently achieving a provided objective such as doing your web research for you, or automatically replying to emails. By giving them a goal, the LLM then plans a series of actions required to solve the goal, which it will then perform sequentially. After each subtask, it can assess the outcome of the subtask and adapt the plan if necessary or continue until the task is resolved.
Figure 10: System overview of an autonomous agent that has access to external tools as well as a task backlog that it populates itself and maintains.
This is a very experimental direction which is evolving quickly. Latest developments have brought additional pieces such as giving the agent the ability to store memories. The most popular example is the open-source AutoGPT based on GPT-4. There is also HuggingGPT that uses an LLM to pick which HuggingFace model to use autonomously, including text, images and sound. Finally we can create realistic NPCs in virtual environments for gaming in particular.
LLMs have ushered in a new era of AI where the entry barrier for many applications has significantly decreased thanks to their strong capabilities across a broad range of tasks. There is often no longer a need to train and maintain custom models, as the emergent properties of LLMs enable in-context learning and high performance through prompt engineering. We have explored several technically feasible applications, and we encourage companies to begin implementing these through initial PoC testing. Finally, we have discussed how LLMs can be augmented with other tools and what the future with autonomous agents might look like.
This work is based on information relevant as of May 10th.