LLM Interview Questions

author:BZdate:2025-10-30

Questions

Machine Learning

??? tip “Machine Learning Concepts”

1
??? question "How would you describe the concept of machine learning in your own words?"
2

3
    Machine learning focuses on creating systems that improve their performance on a task by learning patterns from data rather than relying on explicit programming.
4

5
??? question "Can you give a few examples of real-world areas where machine learning is particularly effective?"
6

7
    Machine learning is especially valuable for solving complex problems without clear rule-based solutions, automating decision-making instead of hand-crafted logic, adapting to changing environments, and extracting insights from large datasets.
8

9

10
??? question "What are some typical problems addressed with unsupervised learning methods?"
11

12
    Typical unsupervised learning tasks include clustering, data visualization, dimensionality reduction, and association rule mining.
13

14

15
??? question "Would detecting spam emails be treated as a supervised or unsupervised learning problem, and why?"
16
    Spam filtering is an example of a supervised learning problem because the model learns from examples of emails labeled as "spam" or "not spam".
17

18

19
??? question "What does the term ‘out-of-core learning’ refer to in machine learning?"
20

21
    Out-of-core learning enables training on datasets too large to fit in memory by processing them in smaller chunks (mini-batches) and updating the model incrementally.
22

23

24
??? question "How can you distinguish between model parameters and hyperparameters?"
25

26
    - **Model parameters** define how the model behaves and are learned during training (e.g., weights in linear regression).
27

28
    - **Hyperparameters** are external settings chosen before training, such as the learning rate or regularization strength.
29

30

31
??? question "What are some major difficulties or limitations commonly faced when building machine learning systems?"
32

33
      Key challenges in machine learning include
34

35
        - insufficient or low-quality data
36
        - poor feature selection
37
        - non-representative samples
38
        - models that either underfit (too simple) or overfit (too complex)
39

40
??? question "If a model performs well on training data but poorly on unseen data, what issue is occurring, and how might you address it?"
41

42
    When a model performs well on training data but poorly on unseen examples, it’s overfitting. This can be mitigated by collecting more diverse data, simplifying the model, applying regularization, or cleaning noisy data.
43

44
??? question "What is a test dataset used for, and why is it essential in evaluating a model’s performance?"
45

46
    A test set provides an unbiased estimate of how well a model will perform on new, real-world data before deployment.
47

48
??? question "What role does a validation set play during the model development process?"
49

50
    A validation set helps compare multiple models and tune hyperparameters, ensuring better generalization to unseen data.
51

52
??? question "What is a train-dev dataset, in what situations would you create one, and how is it applied during model evaluation?"
53

54
    The train-dev set is a small portion of the training data set aside to identify mismatches between the training distribution and the validation/test distributions. You use it when you suspect that your production data may differ from your training data. The model is trained on most of the training data and evaluated on the train-dev set to detect overfitting or data mismatch before comparing results on the validation set.
55

56
??? question "Why is it problematic to adjust hyperparameters based on test set performance?"
57

58
    If you tune hyperparameters using the test set, you risk overfitting to that specific test data, making your performance results misleadingly high. As a result, the model might perform worse in real-world scenarios because the test set is no longer an unbiased measure of generalization.

LLM Fundamentals

??? question “Explain bias-variance tradeoff. How does it manifest in LLMs?”

1
- <span class="def-mono-red">Bias</span>: Error from incorrect assumption in the model
2
    - High bias leads to underfitting, where the model fails to capture patterns in the training data
3

4
- <span class="def-mono-red">Variance</span>: Error from sensitivity to small fluctuations in the training data
5
    - High variance leads to overfitting, where the model memorizes noise instead of learning generalization patterns
6

7

8
The bias-variance tradeoff in ML describe the tension between **ability to fit training data** and **ability to generalize the new data**
9

10
<span class="def-mono-blue">bias-variance in LLM</span>:
11

12
- <span class="def-mono-gold">Model Parameters: Capacity vs. Overfitting</span>
13
    - **Too few parameters**: A model with insufficient (e.g. small transformer) cannot capture complex patterns in the data, leading to high bias.
14
    > A small LLM might fail to understand language or generate coherent long texts.
15
    - **Too many parameters**: A model with excessive capacity risks overfitting to training data, memorizing noise and details instead of learning generalizable patterns
16
    > A large LLM fine-tuned on a small dataset may generate text that is statistically similar to the training data but lack coherence and factual accuracy. (e.g. <span class="def-mono-red">hallucinations</span>)
17
    - **Balancing Act**:
18
    > More parameters reduce bias by enabling the model to capture complex patterns but increase variance if not regularized.
19
    > Regularization techniques: (e.g dropout, weight decay) help mitigate overfitting in high-parameter models
20

21
- <span class="def-mono-gold">Training Epochs: Learning Duration vs. Overfitting</span>
22
    - **Too few epochs**: The model hasn't learned enough from the data, leading to high bias.
23
    > A transformer trained for only 1 epoch may fail to capture meaningful relationships in the text.
24
    - **Too many epochs**: The model starts memorizing training data, increasing variance. This is common in transformer with high capacity and small datasets
25
    > A transformer fine-tuned on a medical dataset for 100 epochs may overfit to rare cases, leading to poor generalization.
26
    - **Tradeoff in Transformers**
27
    > Training loss decreases with epochs (low bias), but validation loss eventually increase (high variance).
28
    >
29
    > Early stopping is critical for transformers to avoid overfitting, especially when training on small or noisy datasets.
30
- <span class="def-mono-gold">Noise vs Representativeness </span>
31
    - **Low-quality data**: Noisy, biased, or incomplete data prevents the model from learning accurate patterns, increasing biase.
32
    > A transformer trained on a dataset with limited examples of rare diseases may fail to diagnose them accurately
33
    - **Noisy/unrepresentative data**: The model learns inconsistent patterns, increasing variance.
34
    > A dataset with duplicate or corrupted text may cause the model to overfit. A transformer trained on a dataset with biased political content
35
    > may generate polarized outputs.
36
    > Data augmentation (e.g. paraphrasing, back-translation) increases diversity, mitigating overfitting

??? question “What is the difference between L1 and L2 regularization? When would you use elastic net in an LLM fine-tune?”

1
<span class="def-mono-gold">Regularization</span> adds a penalty term to the loss function so that the optimizer favours simpler or smoother solutions.
2
In practice it is usually added to a model‑level loss (cross‑entropy, MSE, …) as a separate scalar that scales with the weights.
3

4
\[  \text{Loss}_{\text{regularized}} = \text{Loss}_{\text{original}} + \lambda \cdot \text{Penalty}(w) \]
5

6
|Feature | L1 (Lasso) | L2 (Ridge)|
7
|-:|:-:|:-:|
8
|Weight Behavior| Many → 0 (sparse)|"All → small, non-zero"|
9
|Feature Selection| Yes| No|
10
|Solution|Not always unique|Always unique|
11
|Robust to Outliers|Less|More|
12

13
<span class="def-mono-red"> Key Insight:</span>
14

15
  - <span class="def-mono-blue">L1 regularization</span> is more robust to outliers in the **data (target outliers)**
16
  - <span class="def-mono-blue">L2 regularization</span> is more robust to outliers in the **features (collinearity)**
17

18
<span class="def-mono-red">L1/L2 in LLM</span>:
19

20
  - Use L2 by default. Use L1 if you want sparse, interpretable updates.
21
  - L2 keeps updates smooth. L1 keeps updates minimal — and that’s often better for deployment.
22
  - Use L2 to win benchmarks. Use L1 to ship to users.
23
  > 1. Sparse LoRA = Tiny Adapters
24
  > 2. Faster Inference (Real Speedup!)
25
  > 3. Better Generalization (Less Overfitting)
26
  > 4. Interpretable Fine-Tuning
27
  > 5. Clean Model Merging
28

29
  ```sh
30
    ┌──────────────────────┐
31
    │ Fine-tuning an LLM?  │
32
    └────────┬─────────────┘
33
            │
34
            ▼
35
    ┌──────────────────────┐     YES → Use L2 (weight_decay=0.01)
36
    │ Large, clean data?   │───NO──►┐
37
    └────────┬─────────────┘        │
38
            │                    │
39
            ▼                    ▼
40
    ┌──────────────────────┐ ┌──────────────────────────┐
41
    │ Need max accuracy?   │ │ Want small/fast model?   │
42
    └────────┬─────────────┘ └────────────┬─────────────┘
43
            │                          │
44
            YES                        YES
45
            │                          │
46
            ▼                          ▼
47
        Use L2                     Use L1 (+ pruning)
48
  ```

??? question “Prove that dropout is equivalent to an ensemble during inference (hint: geometric distribution).” - Where dropout appears in a Transformer - Attention dropout - Feedforward dropout - Residual dropout - The ensemble view of dropout in a Transformer - Each layer (and even each neuron) may be dropped independently. - A particular dropout mask defines one specific subnetwork (one “member” of the ensemble).

1
**During training:** Randomly turn off some neurons (like flipping a coin for each one). This forces the network to learn many different "sub-networks" — each time you train, a different combination of neurons is active.
2

3
**During testing (inference):** Instead of picking one sub-network, we use all neurons, but scale down their strength (usually by half if dropout rate is 50%). This is the "mean network."
4

5
<span class="def-mono-gold">Why this is like an ensemble:</span>
6
Imagine you could run the model 1,000 times (or $2^{(N)}$ times for N neurons), each time with a different random set of neurons turned off, and then average all their predictions. **That would be a huge ensemble of sub-networks** — very accurate, but way too slow.
7
Dropout’s trick: Using the scaled "mean network" at test time gives exactly the same prediction as if you had averaged the geometric mean of all those possible sub-networks.
8

9
<span class="def-mono-blue">Dropout = training lots of sub-networks, inference = using their collective average — fast and smart.</span>

??? question “What is the curse of dimensionality? How do positional encodings mitigate it in Transformers?” Higher dimensions → sparser data → harder to learn meaningful relationships.

1
The curse of dimensionality refers to the set of problems that arise when data or model representations exist in high-dimensional spaces.
2

3
- **Data sparsity:** Points become exponentially sparse — distances between points tend to concentrate, making similarity less meaningful.
4
- **Combinatorial explosion:** The volume of the space grows exponentially $O(k^{(d)})$, so covering it requires exponentially more data.
5
- **Poor generalization:** Models struggle to learn smooth mappings because there’s too little data to constrain the high-dimensional space.
6

7

8
**Token in Transformers**
9
Transformers process tokens as vectors in a **high-dimensional** embedding space (e.g., 768 or 4096 dimensions).
10
However — *self-attention* treats each token as a set element rather than a sequence element. The attention mechanism itself has no built-in sense of order.
11
*The model only knows “content similarity,” not which token came first or last.*
12

13
Without order, the model would need to **learn positional relationships implicitly** across high-dimensional embeddings.
14
That’s hard — and it exacerbates the curse of dimensionality because:
15

16
 - There’s no geometric bias for position.
17
 - Each token embedding can drift freely in a massive space.
18
 - The model must infer ordering purely from statistical co-occurrence — *requiring more data and more parameters.*
19

20
 **How Positional Encodings Help**
21

22
 **Positional encodings (PEs)** inject structured, low-dimensional information about sequence order directly into the embeddings.
23
    - Adds a geometric bias to embeddings — nearby positions have nearby encodings.
24
    - Reduces the effective search space — positions are no longer independent random vectors.
25
    - Enables extrapolation: the sinusoidal pattern generalizes beyond training positions.
26
    - The model can compute relative positions via linear operations (e.g., dot products of PEs reflect distance).

??? question “Explain maximum likelihood estimation for language modeling.” Training a neural LM (like a Transformer) by minimizing the negative log-likelihood (NLL) is the same as maximizing the likelihood:

1
$$\boxed{
2
\text{Maximizing likelihood}
3
\;\; \Leftrightarrow \;\;
4
\text{Maximizing log-likelihood}
5
\;\; \Leftrightarrow \;\;
6
\text{Minimizing negative log-likelihood}
7
}$$
8

9
**Example**
10
> Sentence: "The cat sat on the mat."
11
>
12
> The MLE objective trains the model to maximize:
13
> $P(\text{The}) \cdot P(\text{cat}|\text{The}) \cdot P(\text{sat}|\text{The cat}) \cdot P(\text{on}|\text{The cat sat}) \cdot \dots$

??? question “What is negative log-likelihood? Write the per-token loss for GPT.”

1
$$\boxed{
2
    \ell(\theta) = \sum_{t=1}^T \log P(x_t \mid x_{<t}; \theta)
3
}
4
$$
5

6
$$\boxed{
7
    \text{NLL}(\theta) = -\ell(\theta)
8
}
9
$$
10

11
$$\boxed{
12
    \text{NLL}(\theta) = -\sum_{t=1}^T \log P(x_t \mid x_{<t}; \theta)
13
}
14
$$
15

16
**where** $x_{<t}$ means **All tokens** before $t$: $x_1, \dots, x_{t-1}$
17
<!-- $$\boxed{
18
x_{<t} = \text{the past context used to predict } x_t
19
}$$ -->
20

21
This is the heart of autoregressive language modeling — like GPT!

??? question “Compare cross-entropy, perplexity, and BLEU. When is perplexity misleading?”

1
1. **Cross-Entropy:** Cross-entropy measures how well a probabilistic model predicts a target distribution — in LM, how well the model assigns high probability to the correct next tokens.
2
2. **Perplexity:** Perplexity (PPL) is simply the exponentiation of the cross-entropy
3
3. **BLEU (Bilingual Evaluation Understudy):** BLEU is an n-gram overlap metric for evaluating machine translation or text generation quality against reference texts
4

5
<span class="def-mono-blue">Perplexity is rephrasing cross-entropy in a more intuitive, more human-readable.</span>
6

7
> **Perplexity = "How predictable is the language?"**
8
>
9
> **BLEU = "How much does the output match a reference?"**
10
>
11
> Example:
12
>
13
> Reference: "The cat is on the mat."
14
>
15
> Model output: "The dog is on the mat."
16
>
17
> → Low perplexity (grammatical, fluent)
18
>
19
> → Low BLEU (wrong content)
20
> **BLEU is non-probabilistic and reference-based — unlike cross-entropy and perplexity.**
21

22

23
⚠️ <span class="def-mono-red">When Perplexity Is Misleading???</span>
24

25
**Perplexity only measures how well the model predicts tokens probabilistically — not how meaningful or correct the generated text is.**
26

27
- Different tokenizations or vocabularies
28
    - A model with smaller tokens or subwords might have lower perplexity just because predictions are more granular, not actually better linguistically.
29
- Domain mismatch
30
    - A model trained on Wikipedia might have low perplexity on Wikipedia text but produce incoherent answers to questions — it knows probabilities, not task structure.
31
- Human-aligned vs statistical objectives
32
    - A model can assign high likelihood to grammatical but dull continuations (e.g., “The cat sat on the mat”) while rejecting creative or rare but correct continuations — good perplexity, poor real-world usefulness.
33
- Non-autoregressive or non-likelihood models
34
    - For encoder-decoder or retrieval-augmented systems, perplexity may not correlate with generation quality because these models are not optimized purely for next-token prediction.
35
- Overfitting
36
    - A model with very low perplexity on training data may memorize text, but generalize poorly (BLEU or human eval drops).

??? question “Why is label smoothing used in LLMs? Derive its modified loss?” Label smoothing is used in LLMs to prevent overconfidence and improve generalization.

1
Instead of training on a **one-hot** target (where the correct token has probability 1 and all others 0), a small portion ε of that probability is spread across all other tokens.
2

3
So the true token gets (1 − ε) probability, and the rest share ε uniformly.
4

5
This changes the loss from the usual “−log p(correct token)” to a mix of:
6

7
 - `(1 − ε) × loss` for the correct token, and
8
 - `ε × average loss` over all tokens.

??? question “What is the difference between hard and soft attention?” - Hard attention → discrete, selective, non-differentiable. - Soft attention → continuous, weighted, differentiable.

Fundamentals of Large Language Models (LLMs)

!!! Abstract “Question Bank”

1
- <span class="def-mono-red">Fundamentals of Large Language Models (LLMs)</span>
2

3
??? tip "LLM Basic"
4

5
    ??? question "What are the main open-source LLM families currently available?"
6

7
        - Llama: Decoder-Only
8
        - Mistral: Decoder-Only (MoE in Mixtral)
9
        - Gemma: Decoder-Only
10
        - Phi: Decoder-Only
11
        - Qwen: Decoder-Only (dense + MoE)
12
        - DeepSeek: Decoder-Only (MoE in V2)
13
        - Falcon: Decoder-Only
14
        - OLMo: Decoder-Only
15

16

17
    ??? question "What’s the difference between prefix decoder, causal decoder, and encoder-decoder architectures?"
18

19
        - **Causal Decoder (Decoder-Only)**: Autoregressive model that generates text left-to-right, attending only to previous tokens.
20
        - **Prefix Decoder (PrefixLM)**: Causal decoder with a bidirectional prefix (input context) followed by autoregressive generation.
21
        - **Encoder-Decoder (Seq2Seq)**: Two separate Transformer stacks(Encoder & Decoder)
22

23
        ??? Example "Causal Decoder"
24

25
            - Prompt
26
            > Translate to French: The cat is on the mat.
27
            - Generation (autoregressive, causal mask):
28
            > Le [only sees "Le"]
29
            >
30
            > Le chat [sees "Le chat"]
31
            >
32
            > Le chat est [sees "Le chat est"]
33
            >
34
            > Le chat est sur [sees up to "sur"]
35
            >
36
            > Le chat est sur le [sees up to "le"]
37
            >
38
            > Le chat est sur le tapis. [final]
39
            - Summary
40
            > **Cannot see future tokens**
41
            >
42
            > **Cannot see full input bidirectionally — but works via prompt engineering**
43

44
        ??? Example "Prefix Decoder"
45
            - Input Format
46
            > [Prefix] The cat is on the mat. [SEP] Translate to French: [Generate] Le chat est sur le tapis.
47
            - Attention
48
            > **Prefix** (The cat is on the mat. [SEP] Translate to French:) → bidirectional
49

50
        ??? Example "Encoder-Decoder"
51

52

53
    ??? question "What is the training objective of large language models?"
54

55
        LLMs are trained to predict the next token in a sequence.
56

57
    ??? question "Why are most modern LLMs decoder-only architectures?"
58

59
        Most modern LLMs are decoder-only because this architecture is the simplest, fastest, and most flexible for large-scale text generation.
60
        Below is the full reasoning, broken into the fundamental, engineering, and use-case levels.
61

62
         - Decoder-only naturally matches the training objective
63
         - Simpler architecture → easier scaling
64
         - Better for long-context generation
65
         - Fits universal multitask learning with a single text stream
66
         - Aligns with inference needs
67
            - streaming output
68
            - token-by-token generation
69
            - low latency
70
            - high throughput
71
            - continuous prompts
72

73
    ??? question "Explain the difference between encoder-only, decoder-only, and encoder-decoder models."
74

75
        - <span class="def-mono-blue">Encoder-only Models (BERT, RoBERTa, DeBERTa, ELECTRA)</span>
76
            - classification (sentiment, fraud detection)
77
            - named entity recognition
78
            - sentence similarity
79
            - search / embeddings
80
            - anomaly or pattern detection
81
        - <span class="def-mono-blue">Decoder-only Models (GPT, Llama, Mixtral, Gemma, Qwen)</span>
82
            - Text generation
83
            - Multi-task language modeling
84
            - Anything that treats tasks as text → text in one stream
85
        - <span class="def-mono-blue">Encoder–Decoder (Seq2Seq) Models (T5, FLAN-T5, BART, mT5, early Transformer models)</span>
86
            - Translation
87
            - Summarization
88
            - Text-to-text tasks with clear input → output mapping
89

90
    ??? question "What’s the difference between prefix LM and causal LM?"
91

92
        - <span class="def-mono-red">Causal LM</span>: every token can only attend to previous tokens.
93
        - <span class="def-mono-red">Prefix LM</span>: the prefix can be fully bidirectional, while the rest is generated causally.
94

95

96
        |Feature|Causal LM|Prefix LM|
97
        |---|---|---|
98
        |Attention|Strictly left-to-right|Prefix: full; Generation: causal|
99
        |Use case|Free-form generation|Conditional generation, prefix tuning|
100
        |Examples|GPT, Llama, Mixtral|T5 (prefix mode), UL2, some prompt-tuning models|
101
        |Future access?|No|Only inside prefix|
102
        |Mask complexity|Simple|Mixed masks|
103

104

105
??? tip "Layer Normalization Variants"
106

107
    ??? question "Comparison of LayerNorm vs BatchNorm vs RMSNorm?"
108

109
        |Norm|Formula|Pros|Cons|
110
        |---|---|---|---|
111
        |BatchNorm|Normalize across batch|Great for CNNs|Bad for variable batch / autoregressive decoding|
112
        |LayerNorm|Normalize across hidden dim|Stable for Transformers|Slightly more compute than RMSNorm|
113
        |RMSNorm|Normalize only scale|Faster, more stable in LLMs|No centering → sometimes slightly less expressive|
114

115
    ??? question "What’s the core idea of DeepNorm?"
116

117
        **DeepNorm keeps the Transformer stable at extreme depths by scaling the residual connections proportionally to the square root of the model depth.**
118

119
    ??? question "What are the advantages of DeepNorm?"
120

121
        **DeepNorm = deep models that actually train and perform well, without tricks.**
122

123
        - Enables Extremely Deep Transformers (1,000+ layers)
124
        - Superior Training Stability
125
        - Improved Optimization Landscape
126
        - Better Performance on Downstream Tasks
127
        - No Architectural Overhead
128
        - Robust Across Scales and Tasks
129

130
    ??? question "What are the differences when applying LayerNorm at different positions in LLMs?"
131

132
        - <span class="def-mono-red">~~Pre-NormPost-Norm~~ (Original Transformer, 2017)</span>: Normalizes after adding the residual.
133
            - Pros:
134
                - Fairly stable for shallow models (<12 layers)
135
                - Works well in classic NMT models
136
            - Cons:
137
                - Fails to train deep models (vanishing/exploding gradients)
138
                - Poor gradient flow
139
                - Not used in modern LLMs
140
        - Pre-Norm (Current Standard in GPT/LLaMA): Normalize before attention or feed-forward
141
            - Pros:
142
                - Much more stable for deep Transformers
143
                - Great training stability up to hundreds of layers
144
                - Works well with small batch sizes
145
                - Default in GPT-2/3, LLaMA, Mistral, Gemma, Phi-3, Qwen2
146
            - Cons:
147
                - Residual stream grows in magnitude unless controlled (→ RMSNorm or DeepNorm often added)
148
                - Slightly diminished expressive capacity compared to Post-Norm (but negligible in practice)
149
        - Sandwich-Norm: LayerNorm applied before AND after sublayers.
150
            - Pros:
151
                - Extra stability & smoothness
152
                - Improved optimization in some NMT models
153

154
            - Cons:
155
                - Expensive (two norms per sublayer)
156
                - Rarely used in large decoder-only LLMs
157

158

159

160
        🧠 Why LayerNorm position matters
161

162
            1. Training Stability
163
                •  Pre-Norm prevents exploding residuals
164
                •  Post-Norm accumulates errors → unstable for deep models
165
            2. Gradient Flow
166
                - Residuals in Pre-Norm allow gradients to bypass the sublayers directly.
167

168

169

170
    ??? question "Which normalization method is used in different LLM architectures?"
171
        **Large decoder-only LLMs almost universally use RMSNorm + Pre-Norm.**
172

173
??? tip "Activation Functions in LLMs"
174

175
    ??? question "What’s the formula for the FFN (Feed-Forward Network) block?"
176

177
        - <span class="def-mono-red">Standard FFN Formula</span>
178

179
            $$\text{FFN}(x) = W_2 \, \sigma(W_1 x + b_1) + b_2$$
180

181
            $$W_1 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{mode$$l}}}$$
182

183
            $$b_1 \in \mathbb{R}^{d_{\text{ff}}}$$
184

185
            $$W_2 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}$$
186

187
            $$b_2 \in \mathbb{R}^{d_{\text{model}}}$$
188

189
            $$\sigma = \text{activation} \text{ }  \text{function} \text{(ReLU in original Transformer, GELU in GPT, SwiGLU/GeLU-linear in modern LLMs)}$$
190

191
        - <span class="def-mono-blue">Gated FFN in LLMs</span>
192

193
            $$\text{FFN}(x) = W_3 \left( \text{Swish}(W_1x) \odot W_2x \right)$$
194

195
            $$\text{Swish}(u) = u \cdot \sigma(u)$$
196

197
            $$W_1, W_2 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}$$
198

199
            $$W_3 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}$$
200

201

202
    ??? question "What’s the GeLU formula?"
203
        **Gaussian Error Linear Unit (GeLU)**
204

205
        $$\text{GeLU}(x) = \frac{x}{2}\left(1 + \operatorname{erf}\left(\frac{x}{\sqrt{2}}\right)\right)$$
206

207
        $$\operatorname{erf}(x) = \frac{2}{\sqrt{\pi}} \int_0^x e^{-t^2} \, dt$$
208

209
    ??? question "What’s the Swish formula?"
210

211
        **Swish is a smooth, non-monotonic activation.**
212

213
        $$\text{Swish}(x) = \frac{x}{1 + e^{-x}}$$
214

215
    ??? question "What’s the formula of an FFN block with GLU (Gated Linear Unit)?"
216

217
    ??? question "What’s the formula of a GLU block using GeLU?"
218

219
    ??? question "What’s the formula of a GLU block using Swish?"
220

221
    ??? question "Which activation functions do popular LLMs use?"
222

223
    ??? question "What are the differences between Adam and SGD optimizers?"
224

225

226
??? tip "Attention Mechanisms — Advanced Topics"
227

228
    ??? question "What are the problems with traditional attention?"
229

230
    ??? question "What are the directions of improvement for attention?"
231

232
    ??? question "What are the attention variants?"
233

234
    ??? question "What issues exist in multi-head attention?"
235

236
    ??? question "Explain Multi-Query Attention (MQA)."
237

238
    ??? question "Compare Multi-head, Multi-Query, and Grouped-Query Attention."
239

240
    ??? question "What are the benefits of MQA?"
241

242
    ??? question "Which models use MQA or GQA?"
243

244
    ??? question "Why was FlashAttention introduced? Briefly explain its core idea."
245

246
    ??? question "What are FlashAttention advantages?"
247

248
    ??? question "Which models implement FlashAttention?"
249

250
    ??? question "What is parallel transformer block?"
251

252
    ??? question "What’s the computational complexity of attention and how can it be improved?"
253

254
    ??? question "Compare MHA, GQA, and MQA — what are their key differences?"
255

256

257

258
??? tip "Cross-Attention"
259

260
    ??? question "Why do we need Cross-Attention?"
261

262
    ??? question "Explain Cross-Attention."
263

264
    ??? question "Compare Cross-Attention and Self-Attention — similarities and differences."
265

266
    ??? question "Provide a code implementation of Cross-Attention."
267

268
    ??? question "What are its application scenarios?"
269

270
    ??? question "What are the advantages and challenges of Cross-Attention?"
271

272

273
??? tip "Transformer Operations"
274

275
    ??? question "How to load a BERT model using transformers?"
276

277
    ??? question "How to output a specific hidden_state from BERT using transformers?"
278

279
    ??? question "How to get the final or intermediate layer vector outputs of BERT?"
280

281

282
??? tip "LLM Loss Functions"
283

284
    ??? question "What is KL divergence?"
285

286
    ??? question "Write the cross-entropy loss and explain its meaning."
287

288
    ??? question "What’s the difference between KL divergence and cross-entropy?"
289

290
    ??? question "How to handle large loss differences in multi-task learning?"
291

292
    ??? question "Why is cross-entropy preferred over MSE for classification tasks?"
293

294
    ??? question "What is information gain?"
295

296
    ??? question "How to compute softmax and cross-entropy loss (and binary cross-entropy)?"
297

298
    ??? question "What if the exponential term in softmax overflows the float limit?"
299

300

301
??? tip "Similarity & Contrastive Learning"
302

303
    ??? question "Besides cosine similarity, what other similarity metrics exist?"
304

305
    ??? question "What is contrastive learning?"
306

307
    ??? question "How important are negative samples in contrastive learning, and how to handle costly negative sampling?"
308

309

310

311
- <span class="def-mono-red">Advanced Topics in LLMs</span>
312

313
??? tip "Advanced LLM"
314
    ??? question "What is a generative large model?"
315

316
    ??? question "How do LLMs make generated text diverse and non-repetitive?"
317

318
    ??? question "What is the repetition problem (LLM echo problem)? Why does it happen? How can it be mitigated?"
319

320
    ??? question "Can LLaMA handle infinitely long inputs? Explain why?"
321

322
    ??? question "When should you use BERT vs. LLaMA / ChatGLM models?"
323

324
    ??? question "Do different domains require their own domain-specific LLMs? Why?"
325

326
    ??? question "How to enable an LLM to process longer texts?"
327

328

329
- <span class="def-mono-red">Fine-Tuning Large Models</span>
330

331
??? tip "General Fine-Tuning"
332

333
    ??? question "Why does the loss drop suddenly in the second epoch during SFT?"
334

335
    ??? question "How much VRAM is needed for full fine-tuning?"
336

337
    ??? question "Why do models seem dumber after SFT?"
338

339
    ??? question "How to construct instruction fine-tuning datasets?"
340

341
    ??? question "How to improve prompt representativeness?"
342

343
    ??? question "How to increase prompt data volume?"
344

345
    ??? question "How to select domain data for continued pretraining?"
346

347
    ??? question "How to prevent forgetting general abilities after domain tuning?"
348

349
    ??? question "How to make the model learn more knowledge during pretraining?"
350

351
    ??? question "When performing SFT, should the base model be Chat or Base?"
352

353
    ??? question "What’s the input/output format for domain fine-tuning?"
354

355
    ??? question "How to build a domain evaluation set?"
356

357
    ??? question "Is vocabulary expansion necessary? Why?"
358

359
    ??? question "How to train your own LLM?"
360

361
    ??? question "What are the benefits of instruction fine-tuning?"
362

363
    ??? question "During which stage — pretraining or fine-tuning — is knowledge injected?"
364

365

366
??? tip "SFT Tricks"
367

368
    ??? question "What’s the typical SFT workflow?"
369

370
    ??? question "What are key aspects of training data?"
371

372
    ??? question "How to choose between large and small models?"
373

374
    ??? question "How to ensure multi-task training balance?"
375

376
    ??? question "Can SFT learn knowledge at all?"
377

378
    ??? question "How to select datasets effectively?
379

380
??? tip "Training Experience"
381

382
    ??? question "How to choose a distributed training framework?"
383

384
    ??? question "What are key LLM training tips?"
385

386
    ??? question "How to choose model size?"
387

388
    ??? question "How to select GPU accelerators?"
389

390

391

392
- <span class="def-mono-red">LangChain and Agent-Based Systems</span>
393

394

395
??? tip "LangChain Core"
396

397
    ??? question "What is LangChain?"
398

399
    ??? question "What are its core concepts?"
400

401
    ??? question "Components and Chains"
402

403
    ??? question "Prompt Templates and Values"
404

405
    ??? question "Example Selectors"
406

407
    ??? question "Output Parsers"
408

409
    ??? question "Indexes and Retrievers"
410

411
    ??? question "Chat Message History"
412

413
    ??? question "Agents and Toolkits"
414

415
??? tip "Long-Term Memory in Multi-Turn Conversations"
416

417
    ??? question "How can Agents access conversation context?"
418

419
    ??? question "Retrieve full history"
420

421
    ??? question "Use sliding window for recent context"
422

423
    ??? question "
424

425
??? tip "Practical RAG Q&A using LangChain"
426

427
    ??? question "(Practical implementation questions about RAG apps in LangChain)"
428

429

430
- <span class="def-mono-red">Retrieval-Augmented Generation (RAG)</span>
431

432

433
??? tip "RAG Basics"
434

435
    ??? question "Why do LLMs need an external (vector) knowledge base?"
436

437
    ??? question "What’s the overall workflow of LLM+VectorDB document chat?"
438

439
    ??? question "What are the core technologies?"
440

441
    ??? question "How to build an effective prompt template?"
442

443
??? tip " RAG Concepts"
444

445
    ??? question "What are the limitations of base LLMs that RAG solves?"
446

447
    ??? question "What is RAG?"
448

449
    ??? question "How to obtain accurate semantic representations?"
450

451
    ??? question "How to align query/document semantic spaces?"
452

453
    ??? question "How to match retrieval model output with LLM preferences?"
454

455
    ??? question "How to improve results via post-retrieval processing?"
456

457
    ??? question "How to optimize generator adaptation to inputs?"
458

459
    ??? question "What are the benefits of using RAG?"
460

461
??? tip "RAG Layout Analysis"
462

463
    ??? question "Why is PDF parsing necessary?"
464

465
    ??? question "What are common methods and their differences?"
466

467
    ??? question "What problems exist in PDF parsing?"
468

469
    ??? question "Why is table recognition important?"
470

471
    ??? question "What are the main methods?"
472

473
    ??? question "Traditional methods"
474

475
    ??? question "pdfplumber extraction techniques"
476

477
    ??? question "Why do we need text chunking?"
478

479
    ??? question "What are common chunking strategies (regex, Spacy, LangChain, etc.)?"
480

481

482

483
??? tip "RAG Retrieval Strategies"
484

485
    ??? question "Why use LLMs to assist recall?"
486

487
    ??? question "HYDE approach: idea and issues"
488

489
    ??? question "FLARE approach: idea and recall strategies"
490

491
    ??? question "Why construct hard negative samples?"
492

493
    ??? question "Random sampling vs. Top-K hard negative sampling"
494

495

496
??? tip "RAG Evaluation"
497

498
    ??? question "Why evaluate RAG?"
499

500
    ??? question "What are the evaluation methods, metrics, and frameworks?"
501

502

503

504
??? tip "RAG Optimization"
505

506
    ??? question "What are the optimization strategies for retrieval and generation modules?"
507

508
    ??? question "How to enhance context using knowledge graphs (KGs)?"
509

510
    ??? question "What are the problems with vector-based context augmentation?"
511

512
    ??? question "How can KG-based methods improve it?"
513

514
    ??? question "What are the main pain points in RAG and their solutions?"
515

516
    ??? question "Content missing"
517

518
    ??? question "Top-ranked docs missed"
519

520
    ??? question "Context loss"
521

522
    ??? question "Failure to extract answers"
523

524
    ??? question "Explain RAG-Fusion. Why it’s needed,Core technologies,Workflow, and Advantages"
525

526
    ??? question ""
527

528

529
??? tip "Graph RAG"
530

531
    ??? question "Why do we need Graph RAG?"
532

533
    ??? question "What is Graph RAG and how does it work? Show a code example and use case."
534

535
    ??? question "How to improve ranking optimization in Graph RAG?"
536

537

538
- <span class="def-mono-red">Parameter-Efficient Fine-Tuning (PEFT)</span>
539

540
??? tip "PEFT Fundamentals"
541

542
    ??? question "What is fine-tuning, and how is it performed?"
543

544
    ??? question "Why do we need PEFT?"
545

546
    ??? question "What is PEFT and its advantages?"
547

548

549
??? tip "Adapter Tuning"
550

551
    ??? question "Why use adapter-tuning?"
552

553
    ??? question "What’s the core idea behind adapter-tuning?"
554

555
    ??? question "How does it differ from full fine-tuning?"