Turn any PDF, video, webpage, or text into a complete learning package — quizzes,
flashcards, structured notes, exercises, glossary, concept map, and more.
sa auto transformer-paper.pdf
sa quiz output/transformer-learning-pack.yaml --difficulty hard
output/transformer-learning-pack.yaml — SkillPack Explorer
12-Section Study Guide — transformer-learning-pack.md
1. Summary
The Transformer model replaces recurrence with self-attention, enabling parallel computation across sequence positions. It achieves state-of-the-art results on machine translation while being significantly faster to train...
08Feed-Forward Network — position-wise FFN with inner dimension 2048
09Residual Connections — skip connections around each sub-layer enabling gradient flow
10Layer Normalization — stabilizes training by normalizing across feature dimensions
11Masked Attention — preventing decoder from attending to future positions
12Parallelization — eliminating sequential bottleneck of RNNs, enabling GPU parallelism
Glossary — 18 terms with cross-references
Term
Definition
Related
Attention
A mechanism that computes a weighted sum of values based on query-key compatibility scores
Self-Attention, Cross-Attention
Self-Attention
Attention where queries, keys, and values all come from the same sequence
Multi-Head Attention
Multi-Head
Running h parallel attention functions with different learned projections
Attention, d_k
Positional Encoding
Sinusoidal or learned vectors added to embeddings to encode sequence position
Embedding
Layer Norm
Normalization technique that normalizes across feature dimensions per token
Batch Norm, Residual
Residual Connection
Skip connection that adds a sub-layer's input to its output: x + Sublayer(x)
Layer Norm
BLEU
Bilingual Evaluation Understudy — precision-based metric for machine translation
Translation, Evaluation
Byte-Pair Encoding
Subword tokenization that iteratively merges frequent character pairs
Tokenization, Vocabulary
+ 10 more terms in the full skill pack...
Flashcards — click to reveal answers
Why does scaled dot-product attention divide by √dk?
Large dot products push softmax into regions with extremely small gradients. Dividing by √dk keeps the variance of dot products at 1, regardless of the dimensionality, ensuring stable gradients during training.
attentionmath
What is the purpose of multi-head attention vs single-head?
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single head, averaging inhibits this. With h=8 heads and d_k=64 each, cost is similar to a single head with full d_model=512.
modelmulti-head
How does the Transformer handle sequence order without recurrence?
Through positional encodings — sinusoidal functions of different frequencies added to the input embeddings. PE(pos,2i) = sin(pos/10000^(2i/d_model)) and PE(pos,2i+1) = cos(pos/10000^(2i/d_model)). This allows the model to learn relative positions.
positional-encoding
Why does the decoder use masked self-attention?
To preserve the auto-regressive property during training. The mask sets all positions to the right of the current position to -∞ before softmax, preventing the decoder from "cheating" by seeing future tokens it should be predicting.
decodermasking
What is the computational complexity of self-attention vs recurrence?
Self-attention: O(n²·d) per layer. Recurrence: O(n·d²) per layer. Self-attention is faster when n < d (typical for most NLP), and crucially, it has O(1) maximum path length vs O(n) for recurrence, enabling better gradient flow.
complexitycomparison
+ 25 more cards in the full skill pack...
Practice Exercises
Implement Scaled Dot-Product Attention MEDIUM
Write a function that takes Q, K, V matrices and returns the attention output. Include the scaling factor and softmax. Test with a small example to verify correctness.
Analyze Multi-Head vs Single-Head HARD
Design an experiment comparing a single attention head (d_k=512) vs 8 heads (d_k=64 each) on a sequence labeling task. Hypothesize which will perform better and explain why. Consider computational cost and representation capacity.
Positional Encoding Visualization EASY
Plot the sinusoidal positional encoding matrix for sequence length 100 and d_model=64. Explain what patterns you observe and how they encode relative positions.
Critique: Transformer Limitations HARD
The Transformer's self-attention has O(n²) complexity. Analyze three proposed solutions (Linformer, Performer, Flash Attention) and compare their trade-offs in accuracy, speed, and memory usage.
+ 2 more exercises in the full skill pack...
SKILL.md Export — Claude Code / Cursor / Codex Compatible
# sa export output/transformer.yaml --format skill