# CMSC 25700/35100: ## Natural Language Processing ## Lecture 8: Pretraining and Fine-tuning
Chenhao Tan
University of Chicago
@ChenhaoTan, @chenhaotan.bsky.social
chenhao@uchicago.edu
Submit Music Recommendations
, kudos to Mina Lee
Submit jokes
## Logistics * Project - team formation and proposal (Due on Feb 8) - weekly votes * Exam preparation - What you think will be most helpful? * Resources and posts on Ed
## Recall: Causal Self-Attention $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d\_k}} + M\right)V$$ where $M\_{ij} = \left\\{ \begin{array}{ll} 0 & j \leq i \\\ -\infty & j > i \end{array} \right.$ - Each position only attends to itself and previous positions - Enables autoregressive language modeling - Parallel training, sequential generation
## The Complete Forward Pass ```python x = token_embed[input_ids] + position_embed # Embed for layer in transformer_layers: # Each layer x = x + layer.attn(layer.ln_1(x)) # Attention x = x + layer.ffn(layer.ln_2(x)) # FFN x = final_ln(x) # Final norm logits = x @ W_U.T # Unembed probs = softmax(logits) # Probabilities ``` `@` is matrix multiplication (broadcasts over batch dimensions)