Skill-Anything

Any Source → Interactive Skills

Turn any PDF, video, webpage, or text into a complete learning package — quizzes, flashcards, structured notes, exercises, glossary, concept map, and more.

sa auto transformer-paper.pdf

output/transformer-learning-pack.yaml — SkillPack Explorer

12-Section Study Guide — transformer-learning-pack.md

1. Summary

The Transformer model replaces recurrence with self-attention, enabling parallel computation across sequence positions. It achieves state-of-the-art results on machine translation while being significantly faster to train...

2. Concept Map

🗺 AI-generated visual diagram — transformer-learning-pack-concept-map.png

3. Outline

Page 1-2: Introduction & Motivation • Page 3-5: Model Architecture • Page 5-6: Attention Mechanisms • Page 7: Training • Page 8-9: Results • Page 10: Conclusion

4. Detailed Notes

Hierarchical notes covering encoder-decoder structure, multi-head attention, positional encoding, feed-forward networks, layer normalization, residual connections, training regime, and ablation studies...

5. Key Concepts

12 concepts from foundational to advanced — Self-Attention, Multi-Head Attention, Positional Encoding, Layer Normalization, ...

6. Glossary

18 domain terms with definitions and cross-references

7. Cheat Sheet

One-page reference: Attention(Q,K,V) = softmax(QK^T/√d_k)V • d_model=512 • h=8 heads • d_k=d_v=64 • FFN: 2048 inner • Dropout=0.1 • Label smoothing=0.1 ...

8. Takeaways

5 actionable items — implement a minimal transformer, experiment with head counts, compare with RNN baselines, ...

9. Quiz (24 questions)

MCQ • True/False • Fill-in-the-Blank • Short Answer • Scenario • Comparison

10. Flashcards (30 cards)

Spaced-repetition cards covering all key concepts and formulas

11. Exercises (6 tasks)

Analysis • Design • Implementation • Critique

12. Learning Path

Prerequisites → Next Steps → Recommended Resources

Key Concepts — ordered foundational → advanced

01Sequence-to-Sequence Learning — mapping input sequences to output sequences
02Attention Mechanism — computing relevance weights between all positions
03Self-Attention — each position attending to all other positions in the same sequence
04Scaled Dot-Product Attention — softmax(QK^T/√d_k)V prevents gradient vanishing
05Multi-Head Attention — parallel attention with h=8 independent heads for richer representations
06Positional Encoding — sinusoidal functions injecting sequence order information
07Encoder-Decoder Architecture — 6 encoder layers + 6 decoder layers with cross-attention
08Feed-Forward Network — position-wise FFN with inner dimension 2048
09Residual Connections — skip connections around each sub-layer enabling gradient flow
10Layer Normalization — stabilizes training by normalizing across feature dimensions
11Masked Attention — preventing decoder from attending to future positions
12Parallelization — eliminating sequential bottleneck of RNNs, enabling GPU parallelism

Glossary — 18 terms with cross-references

Term	Definition	Related
Attention	A mechanism that computes a weighted sum of values based on query-key compatibility scores	Self-Attention, Cross-Attention
Self-Attention	Attention where queries, keys, and values all come from the same sequence	Multi-Head Attention
Multi-Head	Running h parallel attention functions with different learned projections	Attention, d_k
Positional Encoding	Sinusoidal or learned vectors added to embeddings to encode sequence position	Embedding
Layer Norm	Normalization technique that normalizes across feature dimensions per token	Batch Norm, Residual
Residual Connection	Skip connection that adds a sub-layer's input to its output: x + Sublayer(x)	Layer Norm
BLEU	Bilingual Evaluation Understudy — precision-based metric for machine translation	Translation, Evaluation
Byte-Pair Encoding	Subword tokenization that iteratively merges frequent character pairs	Tokenization, Vocabulary

+ 10 more terms in the full skill pack...

Flashcards — click to reveal answers

Why does scaled dot-product attention divide by √d_k?

Large dot products push softmax into regions with extremely small gradients. Dividing by √d_k keeps the variance of dot products at 1, regardless of the dimensionality, ensuring stable gradients during training.

attentionmath

What is the purpose of multi-head attention vs single-head?

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single head, averaging inhibits this. With h=8 heads and d_k=64 each, cost is similar to a single head with full d_model=512.

modelmulti-head

How does the Transformer handle sequence order without recurrence?

Through positional encodings — sinusoidal functions of different frequencies added to the input embeddings. PE(pos,2i) = sin(pos/10000^(2i/d_model)) and PE(pos,2i+1) = cos(pos/10000^(2i/d_model)). This allows the model to learn relative positions.

positional-encoding

Why does the decoder use masked self-attention?

To preserve the auto-regressive property during training. The mask sets all positions to the right of the current position to -∞ before softmax, preventing the decoder from "cheating" by seeing future tokens it should be predicting.

decodermasking

What is the computational complexity of self-attention vs recurrence?

Self-attention: O(n²·d) per layer. Recurrence: O(n·d²) per layer. Self-attention is faster when n < d (typical for most NLP), and crucially, it has O(1) maximum path length vs O(n) for recurrence, enabling better gradient flow.

complexitycomparison

+ 25 more cards in the full skill pack...

Practice Exercises

Implement Scaled Dot-Product Attention MEDIUM

Write a function that takes Q, K, V matrices and returns the attention output. Include the scaling factor and softmax. Test with a small example to verify correctness.

Analyze Multi-Head vs Single-Head HARD

Design an experiment comparing a single attention head (d_k=512) vs 8 heads (d_k=64 each) on a sequence labeling task. Hypothesize which will perform better and explain why. Consider computational cost and representation capacity.

Positional Encoding Visualization EASY

Plot the sinusoidal positional encoding matrix for sequence length 100 and d_model=64. Explain what patterns you observe and how they encode relative positions.

Critique: Transformer Limitations HARD

The Transformer's self-attention has O(n²) complexity. Analyze three proposed solutions (Linformer, Performer, Flash Attention) and compare their trade-offs in accuracy, speed, and memory usage.

+ 2 more exercises in the full skill pack...

SKILL.md Export — Claude Code / Cursor / Codex Compatible

# sa export output/transformer.yaml --format skill

output/transformer-architecture/
  SKILL.md
  references/
    detailed-notes.md
    glossary.md
    learning-path.md
  assets/
    quiz.yaml        # 20 questions
    flashcards.yaml # 25 cards
    exercises.yaml
    concept-map.png
  scripts/
    quiz.py          # standalone runner

SKILL.md Preview:

---

name: transformer-architecture

description: >-

This skill should be used when the user asks about

"self-attention", "multi-head attention", "transformer"...

version: 1.0.0

---

# Transformer Architecture

## Overview

The Transformer replaces recurrence with self-attention...

## Key Concepts

1. Self-Attention 2. Multi-Head Attention 3. QKV...

Copy to ~/.claude/skills/ or ~/.cursor/skills/ to use as an AI skill.