Step-by-Step

Build a LLM (1/2)

It's a journey into the heart of modern AI, where you'll learn about data preprocessing, model architecture, and the fascinating process of training a model to understand and generate human-like text.

This post is based mainly on my own experience while following the steps in the book, "Build a Large Language Model (from scratch)" by Sebastian Raschka.

Large Language Model in short

There are already too many definitions for this term. So in short, we could think it like a machine software that is able to response to human-like text. LLMs utilizes Transformer architecture which allows them to predict the output based on the attention to different parts of the input.

Stages of building and using LLMs

There are 2 most popular categories of fine-tuning LLMs are:

Instruction fine-tuning: the labeled dataset consists of instruction and answer pairs.
Classification fine-tuning: the labeled dataset consists of texts and associated class labels. (for ex: "spam" and "not spam" labels)

Transformer architecture

Training LLMs process

Data preparation

Machine is not human, hence it needs to convert human texts into vectors for it to process. To to that, first we split the text into tokens:

Then we convert those tokens into vectors, something likes: [ 0.3374, -0.1778, -0.1690] (this is 3-dimension vector, in reality it could be thousands of dimensions, for ex: GPT-3 used 12,288 dimensions)

However, if we only converting text into tokens, then the LLM will only understand the word itself but could not differentiate the position of the token which creates a context with others sibling words (of course only us, human, could understand the context). So we will enrich those vectors with position vectors: absolute or relative positional embeddings.

When saying embeddings, we mean it to be "add" operation.

Above is the illustration of the same token (vector: [1,1,1]) now can be embedded with position vectors to be different vectors (top row), which is supposed to have "context".

Runnable code

import tiktoken
import torch
from torch.utils.data import Dataset, DataLoader
import os
import urllib.request

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

# this method use tiktoken or BytePair encoding tokenizer
def create_dataloader_v1(txt, batch_size, max_length, stride,
                         shuffle=True, drop_last=True, num_workers=0):
    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)

    return dataloader

if __name__ == "__main__":

    if not os.path.exists("the-verdict.txt"):
        url = ("https://raw.githubusercontent.com/rasbt/"
            "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
            "the-verdict.txt")
        file_path = "the-verdict.txt"
        urllib.request.urlretrieve(url, file_path)

    with open("the-verdict.txt", "r", encoding="utf-8") as f:
        raw_text = f.read()

    vocab_size = 50257
    output_dim = 256
    context_length = 1024


    token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
    pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

    batch_size = 8
    max_length = 4
    dataloader = create_dataloader_v1(
        raw_text,
        batch_size=batch_size,
        max_length=max_length,
        stride=max_length
    )

    data_iter = iter(dataloader)
    inputs, targets = next(data_iter)
    print("Inputs:\n", inputs)
    print("\nTargets:\n", targets)

    token_embeddings = token_embedding_layer(inputs)
    print(token_embeddings.shape)

    pos_embeddings = pos_embedding_layer(torch.arange(max_length))
    print(pos_embeddings.shape)

    input_embeddings = token_embeddings + pos_embeddings
    print(input_embeddings.shape)

dataset.py

Attention mechanisms

Self-attention is a mechanism that allows each position in the input sequence to consider the relevancy of, or "attend to," all other positions in the same sequence. In sort, LLM needs to know the importance of the token in the sequence to "understand" the context.

Simplified self-attention

This is not a part of the structure of Transformer block, it is rather the basic idea of the self-attention representation in LLM.

The self-attention of 2nd token to the others in the same sequence

The idea is using dot-product (a method of multiplying two vectors to yield a scalar value) to measure the similarity of the two token. Above example is measuring the vector token 2 (of "journey") "attends to" other tokens in the sequence. The higher value means the two are higher similarity.

The softmax function then is used to normalize those weight \( w_{2i} \) that we calculated above:

\[ \alpha_{2i}=softmax(w_{2i})=\frac{e^{w_{2i}}}{\sum_{j=1}^{n}e^{w_{2j}}} \]

above function will amplify the importance of the token, make sure that total of all weights equals to 1 for easy loss computation.

Then these normalized weight \( \alpha_{2i} \) will multiply with token vector and then sum for context vector \( z ^{(2)} \). So now, instead of using \( x ^{(2)} \), we use \( z ^{(2)} \) because it has additional information about the attention of \( x ^{(2)} \) with the rest of the sequence.

Context vector calculation for the 2nd token

In fact, we don't use \( z ^{(2)} \) as input of any layer, as mentioned earlier, this process is just for understanding the concept. Also, this \( z ^{(2)} \) value is what we want the LLM to learn from the data (target value)

In above example, we calculate the context vector from T input tokens, this T is also the context length of the LLM. For ex, GPT-2 has context length of 1024 means the context vector are calculated based on 1024 input tokens and use that information to predict the next token. If the input data is less than this context length, it will be padded with special token and ignored in self-attention.

Self-attention with trainable weights

In the LLM architecture, instead of calculate the fixed value of \( \alpha_{2i} \) previously, we use trainable weight matrices \( W_{q} \) (like a search query), \( W_{k} \) (like the key used for indexing), \( W_{v} \) (like value of the data). There matrices can be think as the placeholders to hold the learnt parameters of LLM; it will be initialized with controlled random values, and will be adjusted based on the backward propagation of learning process.

Okay, so to be cleared about these matrices, we need to understand the learning process generally: whenever data is feed into the LLM for training, it (after tokenized, vectorized and normalized) will go through the computation process similar to the self-attention concept that we explained earlier in a couple times so that in final output we can determine what is the predicted word. Based on the output we will compare with the expectation and propagate back this difference through layers of the LLM and each layer will make adjustment on those \( W_{q} \), \( W_{k} \), \( W_{v} \), so that it will be better in prediction the next word. This process will be happened trillion times and the final set of values of \( W_{q} \), \( W_{k} \), \( W_{v} \) will be the final model params that we will use for inference.

Context vector calculation with Wq, Wk, Wv matrices

Runnable code

import torch
import torch.nn as nn

class SelfAttention_v2(nn.Module):

    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)
        
        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)

        context_vec = attn_weights @ values
        return context_vec
    

if __name__ == "__main__":

    inputs = torch.tensor(
    [[0.43, 0.15, 0.89], # Your     (x^1)
    [0.55, 0.87, 0.66], # journey  (x^2)
    [0.57, 0.85, 0.64], # starts   (x^3)
    [0.22, 0.58, 0.33], # with     (x^4)
    [0.77, 0.25, 0.10], # one      (x^5)
    [0.05, 0.80, 0.55]] # step     (x^6)
    )

    d_in = inputs.shape[1]
    d_out = 2
    print(f'{d_in}, {d_out}')
    torch.manual_seed(789)
    sa_v2 = SelfAttention_v2(d_in, d_out)
    print(sa_v2(inputs))

self_attention.py

Causal attention

We has understand how the context vector is calculated in LLM, but in reality, we don't want the LLM will be feed with all the data at once, but we want it to predict the next token based on the token we feed into to (incomplete sequence).

We want the LLM to predict one token a time based on the feed data

In other to do that, we need to masked out future tokens before feeding to the LLMs

The flow of masking out the future token

Remember, when masked those attention scores that above the diagonal, we make the matrices become unnormalized (total of a row is not 1 anymore) so a normalization is needed after the masking.

Runnable code

import torch
import torch.nn as nn

class CausalAttention(nn.Module):

    def __init__(self, d_in, d_out, context_length,
                 dropout, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout) # New
        self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) # New

    def forward(self, x):
        b, num_tokens, d_in = x.shape # New batch dimension b
        # For inputs where `num_tokens` exceeds `context_length`, this will result in errors
        # in the mask creation further below.
        # In practice, this is not a problem since the LLM (chapters 4-7) ensures that inputs  
        # do not exceed `context_length` before reaching this forward method. 
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.transpose(1, 2) # Changed transpose
        attn_scores.masked_fill_(  # New, _ ops are in-place
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)  # `:num_tokens` to account for cases where the number of tokens in the batch is smaller than the supported context_size
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        attn_weights = self.dropout(attn_weights) # New

        context_vec = attn_weights @ values
        return context_vec
    
if __name__ == "__main__":
    
    inputs = torch.tensor(
    [[0.43, 0.15, 0.89], # Your     (x^1)
    [0.55, 0.87, 0.66], # journey  (x^2)
    [0.57, 0.85, 0.64], # starts   (x^3)
    [0.22, 0.58, 0.33], # with     (x^4)
    [0.77, 0.25, 0.10], # one      (x^5)
    [0.05, 0.80, 0.55]] # step     (x^6)
    )

    d_in = inputs.shape[1]
    d_out = 2

    batch = torch.stack((inputs, inputs), dim=0)
    print(batch.shape)

    torch.manual_seed(123)

    context_length = batch.shape[1]
    ca = CausalAttention(d_in, d_out, context_length, 0.0)

    context_vecs = ca(batch)

    print(context_vecs)
    print("context_vecs.shape:", context_vecs.shape)

casual_attention.py

Dropout

Dropout in deep learning is a technique where randomly selected hidden layer units are ignored during training, effectively “dropping” them out. This method helps prevent overfitting by ensuring that a model does not become overly reliant on any specific set of hidden layer units.

It’s important to emphasize that dropout is only used during training and is disabled afterward.

Multi-head attention

In short, we will have many group of \( W_{q} \), \( W_{k} \), \( W_{v} \) to make the LLM learn many aspects of the data.

Multi-head computation to produce the context vector

Basically, those \( W_{q2} \), \( W_{k2} \), \( W_{v2} \) are not different from \( W_{q1} \), \( W_{k1} \), \( W_{v1} \) in structure. But they are initialized with different values and get adjusted with different values in the backward propagation process. Because of the independency, each group is consider as a different head to learn different things from the same data.

Runnable code

import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert (d_out % num_heads == 0), \
            "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
        self.dropout = nn.Dropout(dropout)
        self.register_buffer(
            "mask",
            torch.triu(torch.ones(context_length, context_length),
                       diagonal=1)
        )

    def forward(self, x):
        b, num_tokens, d_in = x.shape
        # As in `CausalAttention`, for inputs where `num_tokens` exceeds `context_length`, 
        # this will result in errors in the mask creation further below. 
        # In practice, this is not a problem since the LLM (chapters 4-7) ensures that inputs  
        # do not exceed `context_length` before reaching this forwar

        keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
        queries = self.W_query(x)
        values = self.W_value(x)

        # We implicitly split the matrix by adding a `num_heads` dimension
        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) 
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # Compute scaled dot-product attention (aka self-attention) with a causal mask
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head

        # Original mask truncated to the number of tokens and converted to boolean
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        # Use the mask to fill attention scores
        attn_scores.masked_fill_(mask_bool, -torch.inf)
        
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Shape: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2) 
        
        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec) # optional projection

        return context_vec

if __name__ == "__main__":
    inputs = torch.tensor(
    [[0.43, 0.15, 0.89], # Your     (x^1)
    [0.55, 0.87, 0.66], # journey  (x^2)
    [0.57, 0.85, 0.64], # starts   (x^3)
    [0.22, 0.58, 0.33], # with     (x^4)
    [0.77, 0.25, 0.10], # one      (x^5)
    [0.05, 0.80, 0.55]] # step     (x^6)
    )

    d_in = inputs.shape[1]
    d_out = 2

    batch = torch.stack((inputs, inputs), dim=0)
    print(batch.shape)

    torch.manual_seed(123)

    batch_size, context_length, d_in = batch.shape
    d_out = 2
    mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)

    context_vecs = mha(batch)

    print(context_vecs)
    print("context_vecs.shape:", context_vecs.shape)

multi_head_attention.py

LLM Architecture

Typically, a LLM will have many blocks of Transformer (12 in above image), the output of a Transformer block will be the input of the next once in the sequence.

Layer Normalization

The multi-head attention is basically matrix operations to producing the context value as we discussed earlier, and those operations could produce very close to zero value or very large value after multiple layers. This is also known as vanishing gradient and exploding gradient.

This layer will make the adjustment on the parameters so that total of the value is equal to 0 (mean of 0)

\[ \mu=\frac{\sum_{i=1} ^{T}x_{i}}{T}=0 \]

and the variance of 1.

\[ \sigma ^{2} = \frac{\sum_{i=1} ^{T}(x_{i}-\mu) ^2}{T}=\frac{\sum_{i=1} ^{T}(x_{i}) ^{2}}{T}=1 \]

Vanishing gradient happens when the value is too close to zero; too many zeros after the period. The model is likely stop learning because whatever data feed in the changes is insignificant.

Exploding gradient happens when the value is too big and over the capacity of 64-bit float, it will become inf, and the next math operation on inf will become NaN and the program halts.

Runnable code

import torch
import torch.nn as nn

class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift
    
if __name__ == "__main__":
    # create 2 training examples with 5 dimensions (features) each
    batch_example = torch.randn(2, 5) 

    ln = LayerNorm(emb_dim=5)
    out_ln = ln(batch_example)

    mean = out_ln.mean(dim=-1, keepdim=True)
    var = out_ln.var(dim=-1, unbiased=False, keepdim=True)

    print("Mean:\n", mean)
    print("Variance:\n", var)

layer_norm.py

ReLU activation

For each layer of the LLM, it will calculate the context vectors by the matrix operation, and we can think of \( y_{i+1}=W_{i+1}.y_{i} \), \( W_{i} \) is the parameter matrices we discussed earlier, and hence:

\[ y_{i+2}=W_{i+2}.y_{i+1}=W_{i+2}.(W_{i+1}.y_{i})=(W_{i+2}.W_{i+1}).y_{i}=W_{i} ^{*}.y_{i} \]

That means we can replace both \( W_{i+2} \) and \( W_{i+1} \) by a new matrix, which also means the LLM is unable to learn any new context from the data even adding as many as layers.

ReLU function simply turns all negative values into 0, so the chain of layers is no longer linear and creates opportunity for the LLM to learn new relationship in the input.

In reality, GELU or SwiGLU are used to have smoother activation which allows some negative inputs to contribute to learning process.

Runnable code

import torch
import torch.nn as nn

class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) * 
            (x + 0.044715 * torch.pow(x, 3))
        ))

FeedForward module

This module make an intermediate layer with significant higher number of parameters of both input and output (usually has the same size), so that the model can have ability like generalization data.

Runnable code

import torch
import torch.nn as nn
from config import GPT_CONFIG_124M
from gelu import GELU

class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
            GELU(),
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
        )

    def forward(self, x):
        return self.layers(x)

if __name__ == "__main__":    
    ffn = FeedForward(GPT_CONFIG_124M)

    # input shape: [batch_size, num_token, emb_size]
    x = torch.rand(2, 3, 768) 
    out = ffn(x)
    print(out.shape)

feed_forward.py

Shortcut connection

We already know the vanishing gradient and this can be happened in both ways: forward and backward propagation which means the value become too small for make signification changes. So the shortcut connection will add the input value to the output of the multi-head attention computation in order to make the output is significant again.

Don't be fooled by the the name of shortcut connection, it does not bypassing the network layer, but adding the input to the output to make more significant output values.

Transformer block

TransformerBlock class includes a multi-head attention mechanism
(MultiHeadAttention) and a feed forward network (FeedForward),
both configured based on a provided configuration dictionary
(cfg), such as GPT_CONFIG_124M. Layer normalization (LayerNorm) is applied before each of these two components, and dropout is applied after them to
regularize the model and prevent overfitting.

Runnable code

GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 1024, # Context length
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.1,       # Dropout rate
    "qkv_bias": False       # Query-Key-Value bias
}

import torch
import torch.nn as nn
from config import GPT_CONFIG_124M
from feed_forward import FeedForward
from layer_norm import LayerNorm
from multi_head_attention import MultiHeadAttention

class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.att = MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            num_heads=cfg["n_heads"], 
            dropout=cfg["drop_rate"],
            qkv_bias=cfg["qkv_bias"])
        self.ff = FeedForward(cfg)
        self.norm1 = LayerNorm(cfg["emb_dim"])
        self.norm2 = LayerNorm(cfg["emb_dim"])
        self.drop_shortcut = nn.Dropout(cfg["drop_rate"])

    def forward(self, x):
        # Shortcut connection for attention block
        shortcut = x
        x = self.norm1(x)
        x = self.att(x)  # Shape [batch_size, num_tokens, emb_size]
        x = self.drop_shortcut(x)
        x = x + shortcut  # Add the original input back

        # Shortcut connection for feed forward block
        shortcut = x
        x = self.norm2(x)
        x = self.ff(x)
        x = self.drop_shortcut(x)
        x = x + shortcut  # Add the original input back

        return x
    

if __name__ == "__main__":
    torch.manual_seed(123)

    x = torch.rand(2, 4, 768)  # Shape: [batch_size, num_tokens, emb_dim]
    block = TransformerBlock(GPT_CONFIG_124M)
    output = block(x)

    print("Input shape:", x.shape)
    print("Output shape:", output.shape)

transformer.py

GPT code

import torch
import torch.nn as nn
import tiktoken

from config import GPT_CONFIG_124M
from layer_norm import LayerNorm
from transformer import TransformerBlock

class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])
        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )
    
    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        #1
        pos_embeds = self.pos_emb(
            torch.arange(seq_len, device=in_idx.device)
        )
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits
    
def generate_text_simple(model, idx, #1
    max_new_tokens, context_size):
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:] #2
        with torch.no_grad():
            logits = model(idx_cond)
        logits = logits[:, -1, :] #3
        probas = torch.softmax(logits, dim=-1) #4
        idx_next = torch.argmax(probas, dim=-1, keepdim=True) #5
        idx = torch.cat((idx, idx_next), dim=1) #6
    return idx

if __name__ == "__main__":

    tokenizer = tiktoken.get_encoding("gpt2")

    batch = []

    txt1 = "Every effort moves you"
    txt2 = "Every day holds a"

    batch.append(torch.tensor(tokenizer.encode(txt1)))
    batch.append(torch.tensor(tokenizer.encode(txt2)))
    batch = torch.stack(batch, dim=0)
    print(batch)

    torch.manual_seed(123)
    model = GPTModel(GPT_CONFIG_124M)

    out = model(batch)
    print("Input batch:\n", batch)
    print("\nOutput shape:", out.shape)
    print(out)

    total_params = sum(p.numel() for p in model.parameters())
    print(f"Total number of parameters: {total_params:,}")

    print("Token embedding layer shape:", model.tok_emb.weight.shape)
    print("Output layer shape:", model.out_head.weight.shape)

    total_params_gpt2 =  total_params - sum(p.numel() for p in model.out_head.parameters())
    print(f"Number of trainable parameters considering weight tying: {total_params_gpt2:,}")

    # Calculate the total size in bytes (assuming float32, 4 bytes per parameter)
    total_size_bytes = total_params * 4

    # Convert to megabytes
    total_size_mb = total_size_bytes / (1024 * 1024)

    print(f"Total size of the model: {total_size_mb:.2f} MB")

    start_context = "Hello, I am"
    encoded = tokenizer.encode(start_context)
    print("encoded:", encoded)
    encoded_tensor = torch.tensor(encoded).unsqueeze(0) #1
    print("encoded_tensor.shape:", encoded_tensor.shape)
    model.eval() #1
    out = generate_text_simple(
        model=model,
        idx=encoded_tensor,
        max_new_tokens=6,
        context_size=GPT_CONFIG_124M["context_length"]
    ) 
    print("Output:", out)
    print("Output length:", len(out[0]))

    decoded_text = tokenizer.decode(out.squeeze(0).tolist())
    print(decoded_text)

gpt.py

Build a LLM (1/2)

Large Language Model in short

Stages of building and using LLMs

Transformer architecture

Training LLMs process

Data preparation

Attention mechanisms

Simplified self-attention

Self-attention with trainable weights

Causal attention

Dropout

Multi-head attention

LLM Architecture

Layer Normalization

ReLU activation

FeedForward module

Shortcut connection

Transformer block

GPT code

Read next

Build a LLM (2/2)

Setup your own blog with Ghost and Docker

Comments ()

Large Language Model in short

Stages of building and using LLMs

Transformer architecture

Training LLMs process

Data preparation

Attention mechanisms

Simplified self-attention

Self-attention with trainable weights

Causal attention

Dropout

Multi-head attention

LLM Architecture

Layer Normalization

ReLU activation

FeedForward module

Shortcut connection

Transformer block

GPT code

Read next

Comments ( )

Comments ()