VelvetShark

Building a Large Language Model from scratch: A learning journey

Build a Large Language Model from Scratch book cover

What is this? I want to understand AI models not only as a user. I want to know how they're built, what's inside, how they are making the magic possible. I don't want to just consume, I want to know the ingredients and I want to be able to cook.

I will be going through Build a Large Language Model (From Scratch) book by Sebastian Raschka. My intention is to go through it daily: 2 hours per day on weekdays and 1 h/day on weekends, until I finish the book. Then I'll proceed to more hands-on projects to put this knowledge to use.

Why am I doing this? I was making notes for myself: to understand better, to see progress, to keep myself accountable. Then I came across this old tweet:

How to become an expert at thing

This page will be updated as I go, until I finish the book, and this is the summarization in my own words.

Who is the intended audience? It probably has the audience of one - just me. It's not a learning resource, as I'm only adding this that I want to remember or understand better. If something is obvious to me, I won't be adding it here. But maybe it could be useful for others too, if not for learning, then at least to see the process of someone learning from scratch and to get inspired to do the same.

Daily progress

Day 1 (Thu)

  • Created this page.
  • Started reading and learning.
  • First step back before moving forward. I don't know PyTorch so I need to start with PyTorch basics in Appendix A before even starting Chapter 1.

Reading Appendix A, "Introduction to PyTorch"

Installed PyTorch, uv, pip, jupyter and ran a sample notebook in Jupyter. Everything works.

finished the day at page 261 of the book, because it's an appendix

Day 2 (Fri)

Continuing learning PyTorch.

p.279

Day 3 (Sat)

Finished PyTorch section.

PyTorch section summary

  • PyTorch is NumPy but for tensors instead of arrays, plus differentiation and deep learning capabilities.
  • Learned what tensors are: multi-dimensional arrays.
  • Works well on GPUs, too bad my Linux machine doesn't have NVIDIA GPU and I have to run everything on an Intel CPU.

Finished reading Appendix A, "Intorduction to PyTorch".

Started Chapter 1: Understanding large language models

p.5

Day 4 (Sun)

LLMs are based on the transformer architecture. The key idea of the transformer architecture is an attention mechanism that gives the LLM selective access to the whole input sequence when generating the output one word at a time. Like reading a paragraph and instinctively focusing on the few words that tell you what it's really about.

The ability to perform tasks that the model wasn't explicitly trained to perform is called an emergent behavior. Translation was such a behavior for GPT models. It wasn't trained for that but was able to do it, based on vast data in different contexts. I wonder how many more emergent behaviors are still uncovered and will surprise us in the future.

Finished Chapter 1: Understanding large language models.

Started Chapter 2: Working with text data.

Embedding - the concept of converting data into a vector format.

An embedding is a mapping from discrete objects, such as words, images, or even entire documents, to points in a continuous vector space. The primary purpose of embeddings is to convert nonnumeric data into a format that neural networks can process.

We can embed different data types: video, audio, text. But different data formats require different embedding models.

p. 22

Day 5 (Mon)

Implementing a "toy" tokenizer to understand how tokenization works and its main concepts.

Byte-pair encoding (BPE) - when a word is not in a model's vocabulary, it allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words

p. 41

Day 6 (Tue)

Chapter 2: Working with text data

Embedding size = number of features per token In general, the more the better but needs to balance cost vs. slower training and wasted capacity.

In addition to token embeddings, positional embeddings are added for each token:

  • The token embeddings (what each word means)
  • The positional embeddings (how far along the sentence we are)

Full embedding process:

  • Get input text
  • Break it down into tokens
  • Convert all tokens to token IDs
  • Generate token embeddings of X size (X = number of dimensions / features per token)
  • Generate positional embeddings of X size (same as token embeddings)
  • Final input embeddings = token embeddings + positional embeddings
    • Same vector size but enriched with positional data which will be important during training

Finished Chapter 2: Working with text data.

Started Chapter 3: Coding attention mechanisms

This is the attention part from Attention Is All You Need paper.

What's attention?

  • The model looks at all the tokens in the input and decides which ones to focus on when processing a particular token.
  • Example: When processing the word "it" in "The cat chased the dog because it was tired", the model needs to pay more attention to "cat", not "dog" or "because."

Self-attention is a mechanism that allows each position in the input sequence to consider the relevancy of, or "attend to", all other positions in the same sequence when computing the representation of a sequence.

  • Cross-attention: one sequence attends to another sequence (decoder attends to encoder outputs)
    • How do the words in this sentence relate to each other?
  • Self-attention: each token attends to all tokens in the same sequence (including itself)
    • How does the generated word relate to the input sentence?

Context vectors are calculated for each item of input to enrich representation of each element in an input sequence by adding information from all other elements in the sequence, to better understand relationship and relevance of words in a sentence to each other.

p.57

Day 7 (Wed)

The dot product is a way to measure how similar two vectors are.

  • Calculated by multiplying two vectors element-wise and then summing the products
  • Measures alignment - how much the two vectors point in the same direction in space
    • If they point the same way → large positive dot product
    • If they're perpendicular → dot product ≈ 0
    • If they point opposite ways → large negative dot product
  • The dot product is how the model decides which words are related (how much attention it should give to a given word)

Context vector = sum (each input vector x attention weight)

Important: at this point, all the calculated numbers (token embeddings, positional embeddings, weights) are just noise, all random. We're just setting the stage for what comes next, nothing makes sense yet. It all starts making sense during training stage, when backpropagation starts nudging those numbers in the right directions and showing the real context.

p.68

Day 8 (Thu)

Causal attention (also: masked attention)

  • A special form of self-attention
  • Restricts a model to only consider previous and current inputs in a sequence when processing any given token when computing attention scores, which is different from standard self-attention, which allows access to the entire input sequence at once.
  • Explicitly masks attention weights in the attention matrix for tokens that come after the input token.
  • Helps with learning to predict the next token effectively (essential for language models).

p.78

Day 9 (Fri)

Dropout - masking some % of the weights to prevent overfitting of data. Only used during training.

Multi-head attention - creating multiple instances of the self-attention mechanism, each with its own learned Q, K, and V projection weights, and then combining their outputs. Each head adds additional embedding dimensions to each token, focusing on different aspects of its meaning (during training).

p.91

Day 10 (Sat)

Started Chapter 4: Implementing a GPT model from scratch to generate text

p.98

Day 11 (Sun)

Training neural networks with many layers can be difficult due to vanishing or exploding gradients, making it difficult for the network to adjust its weights, meaning that the network has difficulties learning the underlying patterns in the data, which results in difficulties in making accurate predictions.

p.103

Day 12 (Mon)

Feed-forward module

  • A small two-layer neural network applied to each token individually
  • Lets each token transform its meaning after seeing context (ie. after applying attention)

It's a little frustrating to keep building a more and more complex model of matrices, added dimensions, weights, vectors, mathematical operations, and trying to see where and how the model (essentially, the LLM) actually "learns", only to repeatedly remind myself that all the "learning" will happen later, during backpropagation (short for "backward propagation of errors"), and everything before that is just the machinery that makes learning possible.

Maybe going backwards would be more logical and easier to understand? But I don't know how easy it would be to explain the concepts then and if it even makes sense. Anyway, back to building yet more complicated mental models that are nothing more than scaffolding yet, doing absolutely nothing, just more and more of evenly distributed randomness.

Feed Forward Network: for each token, take my token's vector, blow it up into a richer space (usually 4x?), apply a nonlinear transformation (GELU), then compress it back down (by 4x, so that the output is the same size as input) into a refined meaing.

Vanishing gradient problem - an issue where gradients (which guide weight updated during training) become progressively smaller as they propagate backwards through layers, making it difficult to effectively learn in earlier layers.

p.111

Day 13 (Tue)

Shortcut connections - creates an alternative, shorter path for the gradient to flow through the network by skipping one or more layers. They help preserve the flow of gradients during the backward pass in training. It's achieved by adding each layer's input to the nonlinear transformation of that input. This way the gradient doesn't become vanishingly smaller with each layout.

p.117

Day 14 (Wed)

Overview of GPT model architecture

Flow of data through GPT model

Source: Build a Large Language Model (From Scratch) by Sebastian Raschka

Flow of data throught GPT model

My understanding of the GPT model so far: what's coming in, how data flows through it, what's coming out.

  • Start at the bottom. Input is text "Every effort moves you". The task of the model is to give the next word in this sequence.
  • Text gets tokenized so that it's an array of numbers (tokens) instead of sentences, words or letters. Tokenization breaks text into subword units using a fixed vocabulary.
  • Tokenized text is then converted into embeddings. Each token ID is looked up in an embedding matrix that turns it into a dense vector (e.g., 768 or 4096 numbers). These numbers encode a rich representation and understanding of what it is. Similar tokens end up with similar vectors.
  • Then another layer of embeddings is added to encode the position of each token. The transformer has no idea of order, positional embeddings add information where each token sits in the sequence.
  • Then some small % of activations are set to 0 during training (dropout) to prevent overfitting by forcing neurons to not rely too heavily on each other and to make learning more effective. This forces the network to rely on multiple pathways rather than memorizing patterns. "At each training step, I'll randomly deactivate some neurons so others learn to contribute". Dropout is used only during training, during inference it's disabled and the model uses all neurons normally.
  • The combined information for this token is passed to a transformer, multiple times, to learn a little with every pass and to get closer to the desired solution. In case of GPT-2 Small model with 124 million parameters, it passes through the transformer 12 times sequentially, (for GPT-3 it's 48), with its output as input for the next iteration. Multiple passes deepen the model's ability to represent abstract relationship in text.
  • For each cycle inside the transformer:
    • Layer normalization. Normalizes activations for each token's vector so that features have consistent scale, improving training stability and gradient flow (loosely, gradient flow = the flow of learning towards the truth)
    • Masked multi-head attention. Each token produces Query (Q), Key (K) and Value (V) vectors, attends to previous tokens (attention = looking at all the tokens in the input and deciding which ones to focus on when processing a particular token) and aggregates context across multiple attention heads, each focusing on different aspects of relationships. Masked = looking only at previous tokens and masking the future ones, so that predictions remain causal.
    • Dropout again. Randomly zeroing out a portion of activations during traing to reduce overfitting.
    • Shortcut (residual) connections. What happens: adding the input of the layer to it its output (x= x + F(X)). What it means: it improves the gradient flow (the ability to learn) through the layers, so that very deeps stack remain trainable.
    • Layer normalization again. Re-normalizes after the attention block to keep activations stable before feeding them into the next component.
    • Feed forward network. A small two-layer neural network applied independently to each token's representation.
      • Expand dimensionality (e.g. 4x, from 768 x 4 = 3072). The goal is to add more capacity to the model, resulting in more meaning and nuance.
      • GELU activation. Applies smooth non-linearity to create complex transformation of features. Unlike dropout, it's used both during training and inference.
      • Compress back down by 4x, to get to the same size as input, but with added learning based on the gathered context.
    • Dropout again
    • Shortcut (residual) connection, integrating the refined version.
  • Final layer normalization. After all transformed block, a last normalization step to ensure consistent scaling of features.
  • Last output layer. Maps each token's final hidden vectore to a vector the size of the tokenization vocabulary. Each position now has a set of scores: for each possible next token. All scores are converted into probabilities and the token with the highest probability is selected, completing the sequence: "Every effort moves you forward".

Scale and size of a simple GPT model

I want to see where the action happens and what percentage of the model's "horse power" is dedicated to what tasks.

Even though attention gets all the fame, most of the parameters aren't in attention at all - they are in the feed-forward networks and the embedding layers.

Parameter distribution in GPT-2 Small

Out of 124M trainable parameters in GPT-2, how are they divided into different tasks?

GPT-2 Small (124M)

  • Total number of parameters: 163,009,536

  • Number of trainable parameters considering weight tying: 124,412,160

  • Token embeddings, essentially a huge vocab table (50,257 tokens x 768 embeddings): 38,597,376 (31%)

  • Total number of parameters in feed forward module: 4,722,432 x 12 heads = 56M (45%)

    • Each head has two big linear layers:
      • Up-projection: 768 × 3072 = 2.36 M
      • Down-projection: 3072 × 768 = 2.36 M
      • Total per block: ≈4.72 M
    • Multiply by 12 layers → ~56 M parameters.
    • That's nearly half the whole model!
  • Total number of parameters in attention module: 2,360,064 x 12 blocks = 28M (23%)

    • Q, K, V, O weights per layer
    • Each attention block:
      • W_Q weight matrix: 768 x 768 = 589,824 parameters
      • W_K weight matrix: 768 x 768 = 589,824 parameters
      • W_V weight matrix: 768 x 768 = 589,824 parameters
      • W_O weight matrix: 768 x 768 = 589,824 parameters
      • Total: 2.36M
      • Attention is mostly matrix multiplications on the fly, not heavy in stored weights.
  • Positional embeddings, LayerNorms, biases: ~1.4M (~1%)

  • Total size of the model (total number of parameters x 4 bytes (float 32) per parameter): 621.83 MB

  • GPT-2XL for comparison (768 → 1,600 embeddings, 12 → 48 layers, 12 → 25 attention heads):

    • Total number of parameters: 1,637,792,000
    • Number of trainable parameters: 1,557,380,800
    • Total size of the model: 6,247.68 MB

Attention = compute-heavy, few weights
Feed-forward = weight-heavy, cheap to run
Embeddings = giant dictionary

p. 123

Shark footer