Large language models (LLMs) are advanced AI-based models trained to process and generate human language in a way that closely mirrors natural human communication. In a nutshell, LLMs are designed to understand and generate text like a human, in addition to other forms of content, based on the vast amount of data used to train them. They have the ability to infer from context, generate coherent and contextually relevant responses, translate to languages other than English, summarize text, answer questions (general conversation and FAQs) and even assist in creative writing or code generation tasks.

generative AI

Generative AI is a kind of machine learning techniques that involve the creation of new data samples from the trained models. These models can also be called as the generative models. In other words, generative models learn the underlying patterns and structures of a given dataset and can generate new samples that resemble the original data.

types of generative AI models

- Generative Adversarial Networks (GANs):
These involve two neural networks, a generator and a discriminator, which work against each other to improve the quality of generated data, often used in realistic image generation.

- Variational Autoencoders (VAEs):
VAEs are used for generating new data points by learning a compressed representation of the input data, commonly applied in image processing and generation.

- Encoder-Decoder Transformer Architecture (e.g., T5, BART):
These models are designed for tasks like text translation, summarization, and question-answering, where both input and output are sequences of data.

- Encoder-Only Transformer Architecture (e.g., BERT):
Primarily used for understanding and processing input data, such as for language understanding tasks (LLMs), but not typically used for generative purposes like the other models mentioned.

- Autoregressive (Decoder-Only Transformer such as GPT):
These models predict the next item in a sequence, making them powerful for tasks like text generation (LLMs).

- Flow-Based Models:
These models, such as Normalizing Flows, are designed to explicitly model the distribution of data, allowing for both efficient generation and density estimation.

- Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) Networks:
Earlier used in sequence generation tasks like text and music before the rise of transformer models.

- Hybrid Models:
Some newer architectures combine elements of different models (like GANs and VAEs) to leverage the strengths of each in generating complex data.

LLMs are based on the Transformer Architecture, which allows them to capture complex language patterns and relationships between words or phrases in large-scale text datasets.

what is a language model

A language model is a machine learning model that aims to predict and generate plausible language. Autocomplete is a language model, for example.

These models work by estimating the probability of a token or sequence of tokens occurring within a longer sequence of tokens. Consider the following sentence:

1
When I hear rain on my roof, I _______ in my kitchen.

If you assume that a token is a word, then a language model determines the probabilities of different words or sequences of words to replace that underscore. For example, a language model might determine the following probabilities:

1
2
3
4
5
6
cook soup 9.4%
warm up a kettle 5.2%
cower 3.6%
nap 2.5%
relax 2.2%
...

A “sequence of tokens” could be an entire sentence or a series of sentences. That is, a language model could calculate the likelihood of different entire sentences or blocks of text.

Estimating the probability of what comes next in a sequence is useful for all kinds of things: generating text, translating languages, and answering questions, to name a few.

what is a large language model

Modeling human language at scale is a highly complex and resource-intensive endeavor. The path to reaching the current capabilities of language models and large language models has spanned several decades.

As models are built bigger and bigger, their complexity and efficacy increases. Early language models could predict the probability of a single word; modern large language models can predict the probability of sentences, paragraphs, or even entire documents.

The size and capability of language models has exploded over the last few years as computer memory, dataset size, and processing power increases, and more effective techniques for modeling longer text sequences are developed.

The large usually describes the num of parameters of the language model. Large language models (LLMs) are called “large” because they are pre-trained with a large number of parameters (100M+) on large corpora of text to process/understand and generate natural language text for a wide variety of NLP tasks.

usage of the LLM

Text generation
language generation abilities, such as writing emails, blog posts or other mid-to-long form content in response to prompts that can be refined and polished.

Content summarization
summarize long articles, news stories, research reports, corporate documentation and even customer history into thorough texts tailored in length to the output format.

AI assistants
chatbots that answer customer queries, perform backend tasks and provide detailed information in natural language as a part of an integrated, self-serve customer care solution.

Code generation
assists developers in building applications, finding errors in code and uncovering security issues in multiple programming languages, even “translating” between them.

Sentiment analysis
analyze text to determine the customer’s tone in order understand customer feedback at scale and aid in brand reputation management.

Language translation
provides wider coverage to organizations across languages and geographies with fluent translations and multilingual capabilities.

how LLM work

LLMs operate by leveraging deep learning techniques and vast amounts of textual data. These models are typically based on a transformer architecture, like the generative pre-trained transformer, which excels at handling sequential data like text input. LLMs consist of multiple layers of neural networks, each with parameters that can be fine-tuned during training, which are enhanced further by a numerous layer known as the attention mechanism, which dials in on specific parts of data sets.

During the training process, these models learn to predict the next word in a sentence based on the context provided by the preceding words. The model does this through attributing a probability score to the recurrence of words that have been tokenized—broken down into smaller sequences of characters. These tokens are then transformed into embeddings, which are numeric representations of this context.

To ensure accuracy, this process involves training the LLM on a massive corpora of text (in the billions of pages), allowing it to learn grammar, semantics and conceptual relationships through zero-shot and self-supervised learning. Once trained on this training data, LLMs can generate text by autonomously predicting the next word based on the input they receive, and drawing on the patterns and knowledge they’ve acquired. The result is coherent and contextually relevant language generation that can be harnessed for a wide range of NLU and content generation tasks.

Model performance can also be increased through prompt engineering, prompt-tuning, fine-tuning and other tactics like reinforcement learning with human feedback (RLHF) to remove the biases, hateful speech and factually incorrect answers known as “hallucinations” that are often unwanted byproducts of training on so much unstructured data. This is one of the most important aspects of ensuring enterprise-grade LLMs are ready for use and do not expose organizations to unwanted liability, or cause damage to their reputation.

concepts in the LLM

Model architecture

Model size

transformer

attention mechanism

Attention mechanisms in LLMs, particularly the self-attention mechanism used in transformers, allow the model to weigh the importance of different words or phrases in a given context. By assigning different weights to the tokens in the input sequence, the model can focus on the most relevant information while ignoring less important details. This ability to selectively focus on specific parts of the input is crucial for capturing long-range dependencies and understanding the nuances of natural language.

Self-attention

multihead attention

parameters

LLMs have millions or even billions of parameters, each influencing how the model comprehends language. These parameters can include:

  • Weights: These determine the importance of specific connections between words and phrases, allowing the model to learn patterns and relationships. Weights are numerical values that define the strength of connections between neurons across different layers in the model. In the context of LLMs, weights are primarily used in the attention mechanism and the feedforward neural networks that make up the model’s architecture. They are adjusted during the training process to optimize the model’s ability to generate relevant and coherent text.

  • Biases: These act as starting points, guiding the model’s interpretations before it sees data. Biases are additional numerical values that are added to the weighted sum of inputs before being passed through an activation function. They help to control the output of neurons and provide flexibility in the model’s learning process. Biases can be thought of as a way to shift the activation function to the left or right, allowing the model to learn more complex patterns and relationships in the input data.

  • Embedding vectors: These represent words numerically, enabling the model to understand their meaning and context.

Tokenization

Tokenization is the process of converting a sequence of text into individual words, subwords, or tokens that the model can understand. In LLMs, tokenization is usually performed using subword algorithms like Byte Pair Encoding (BPE) or WordPiece, which split the text into smaller units that capture both frequent and rare words. This approach helps to limit the model’s vocabulary size while maintaining its ability to represent any text sequence.

pre-training

Pretraining is the process of training an LLM on a large dataset, usually unsupervised or self-supervised, before fine-tuning it for a specific task. During pretraining, the model learns general language patterns, relationships between words, and other foundational knowledge. This process results in a pre-trained model that can be fine-tuned using a smaller, task-specific dataset, significantly reducing the amount of labeled data and training time required to achieve high performance on various NLP tasks.

prompt

prompt engineering

Temperature

The LLM temperature is a hyperparameter that regulates the randomness, or creativity, of the AI’s responses. It determines how creative the model should be. It has probabilities for all the different words that could follow and then selects the next word to output. The Temperature setting tells it which of these words it can use. A Temperature of 0 makes the model deterministic. It limits the model to use the word with the highest probability. You can run it over and over and get the same output. As you increase the Temperature, the limit softens, allowing it to use words with lower and lower probabilities.

prompt-tuning

fine-tuning

Transfer learning

Transfer learning is the technique of leveraging the knowledge gained during pretraining and applying it to a new, related task. In the context of LLMs, transfer learning involves fine-tuning a pre-trained model on a smaller, task-specific dataset to achieve high performance on that task. The benefit of transfer learning is that it allows the model to benefit from the vast amount of general language knowledge learned during pre-training, reducing the need for large labeled datasets and extensive training for each new task.

reinforcement learning with human feedback (RLHF)

performance of the LLM

White Papers for Learning LLMs

Neural Machine Translation by Jointly Learning to Align and Translate (2014)
by Bahdanau, Cho, and Bengio,
https://arxiv.org/abs/1409.0473

Attention Is All You Need (2017)
by Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin,
https://arxiv.org/abs/1706.03762

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)
by Devlin, Chang, Lee, and Toutanova,
https://arxiv.org/abs/1810.04805

Improving Language Understanding by Generative Pre-Training (2018)
by Radford and Narasimhan,
https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension (2019)
by Lewis, Liu, Goyal, Ghazvininejad, Mohamed, Levy, Stoyanov, and Zettlemoyer,
https://arxiv.org/abs/1910.13461

Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond (2023)
by Yang, Jin, Tang, Han, Feng, Jiang, Yin, and Hu,
https://arxiv.org/abs/2304.13712

reference

https://vitalflux.com/generative-modeling-in-machine-learning-examples/
https://www.superannotate.com/blog/llm-overview
https://vitalflux.com/large-language-models-concepts-examples/
https://www.ibm.com/topics/large-language-models
https://developers.google.com/machine-learning/resources/intro-llms
https://txt.cohere.com/llm-parameters-best-outputs-language-ai/
https://www.thecloudgirl.dev/blog/llm-parameters-explained
https://deepchecks.com/glossary/llm-parameters/
https://datascience.stackexchange.com/questions/120764/how-does-an-llm-parameter-relate-to-a-weight-in-a-neural-network