LLM Interview Questions
Questions
Machine Learning
Machine Learning Concepts
LLM Fundamentals
Question: Explain bias-variance tradeoff. How does it manifest in LLMs?
Answer
Tip
-
Bias: Error from incorrect assumption in the model
- High bias leads to underfitting, where the model fails to capture patterns in the training data
-
Variance: Error from sensitivity to small fluctuations in the training data
- High variance leads to overfitting, where the model memorizes noise instead of learning generalization patterns
The bias-variance tradeoff in ML describe the tension between ability to fit training data and ability to generalize the new data
bias-variance in LLM:
-
Model Parameters: Capacity vs. Overfitting
- Too few parameters: A model with insufficient (e.g. small transformer) cannot capture complex patterns in the data, leading to high bias.
A small LLM might fail to understand language or generate coherent long texts.
- Too many parameters: A model with excessive capacity risks overfitting to training data, memorizing noise and details instead of learning generalizable patterns
A large LLM fine-tuned on a small dataset may generate text that is statistically similar to the training data but lack coherence and factual accuracy. (e.g. hallucinations)
- Balancing Act:
More parameters reduce bias by enabling the model to capture complex patterns but increase variance if not regularized. Regularization techniques: (e.g dropout, weight decay) help mitigate overfitting in high-parameter models
-
Training Epochs: Learning Duration vs. Overfitting
- Too few epochs: The model hasn’t learned enough from the data, leading to high bias.
A transformer trained for only 1 epoch may fail to capture meaningful relationships in the text.
- Too many epochs: The model starts memorizing training data, increasing variance. This is common in transformer with high capacity and small datasets
A transformer fine-tuned on a medical dataset for 100 epochs may overfit to rare cases, leading to poor generalization.
- Tradeoff in Transformers
Training loss decreases with epochs (low bias), but validation loss eventually increase (high variance).
Early stopping is critical for transformers to avoid overfitting, especially when training on small or noisy datasets.
-
Noise vs Representativeness
- Low-quality data: Noisy, biased, or incomplete data prevents the model from learning accurate patterns, increasing biase.
A transformer trained on a dataset with limited examples of rare diseases may fail to diagnose them accurately
- Noisy/unrepresentative data: The model learns inconsistent patterns, increasing variance.
A dataset with duplicate or corrupted text may cause the model to overfit. A transformer trained on a dataset with biased political content may generate polarized outputs. Data augmentation (e.g. paraphrasing, back-translation) increases diversity, mitigating overfitting
Question: What is the difference between L1 and L2 regularization? When would you use elastic net in an LLM fine-tune?
Answer
Tip
Regularization adds a penalty term to the loss function so that the optimizer favours simpler or smoother solutions. In practice it is usually added to a model‑level loss (cross‑entropy, MSE, …) as a separate scalar that scales with the weights.
[ \text{Loss}{\text{regularized}} = \text{Loss}{\text{original}} + \lambda \cdot \text{Penalty}(w) ]
| Feature | L1 (Lasso) | L2 (Ridge) |
|---|---|---|
| Weight Behavior | Many → 0 (sparse) | “All → small, non-zero” |
| Feature Selection | Yes | No |
| Solution | Not always unique | Always unique |
| Robust to Outliers | Less | More |
Key Insight:
- L1 regularization is more robust to outliers in the data (target outliers)
- L2 regularization is more robust to outliers in the features (collinearity)
L1/L2 in LLM:
- Use L2 by default. Use L1 if you want sparse, interpretable updates.
- L2 keeps updates smooth. L1 keeps updates minimal — and that’s often better for deployment.
- Use L2 to win benchmarks. Use L1 to ship to users.
- Sparse LoRA = Tiny Adapters
- Faster Inference (Real Speedup!)
- Better Generalization (Less Overfitting)
- Interpretable Fine-Tuning
- Clean Model Merging
┌──────────────────────┐
│ Fine-tuning an LLM? │
└────────┬─────────────┘
│
▼
┌──────────────────────┐ YES → Use L2 (weight_decay=0.01)
│ Large, clean data? │───NO──►┐
└────────┬─────────────┘ │
│ │
▼ ▼
┌──────────────────────┐ ┌──────────────────────────┐
│ Need max accuracy? │ │ Want small/fast model? │
└────────┬─────────────┘ └────────────┬─────────────┘
│ │
YES YES
│ │
▼ ▼
Use L2 Use L1 (+ pruning)Question: Prove that dropout is equivalent to an ensemble during inference (hint: geometric distribution).
Answer
Tip
- Where dropout appears in a Transformer
- Attention dropout
- Feedforward dropout
- Residual dropout
- The ensemble view of dropout in a Transformer
- Each layer (and even each neuron) may be dropped independently.
- A particular dropout mask
defines one specific subnetwork (one “member” of the ensemble).
During training: Randomly turn off some neurons (like flipping a coin for each one). This forces the network to learn many different “sub-networks” — each time you train, a different combination of neurons is active.
During testing (inference): Instead of picking one sub-network, we use all neurons, but scale down their strength (usually by half if dropout rate is 50%). This is the “mean network.”
Why this is like an ensemble:
Imagine you could run the model 1,000 times (or
Dropout = training lots of sub-networks, inference = using their collective average — fast and smart.
Question: What is the curse of dimensionality? How do positional encodings mitigate it in Transformers?
Answer
Tip
Higher dimensions → sparser data → harder to learn meaningful relationships.
The curse of dimensionality refers to the set of problems that arise when data or model representations exist in high-dimensional spaces.
- Data sparsity: Points become exponentially sparse — distances between points tend to concentrate, making similarity less meaningful.
- Combinatorial explosion: The volume of the space grows exponentially
, so covering it requires exponentially more data. - Poor generalization: Models struggle to learn smooth mappings because there’s too little data to constrain the high-dimensional space.
Token in Transformers Transformers process tokens as vectors in a high-dimensional embedding space (e.g., 768 or 4096 dimensions). However — self-attention treats each token as a set element rather than a sequence element. The attention mechanism itself has no built-in sense of order. The model only knows “content similarity,” not which token came first or last.
Without order, the model would need to learn positional relationships implicitly across high-dimensional embeddings. That’s hard — and it exacerbates the curse of dimensionality because:
- There’s no geometric bias for position.
- Each token embedding can drift freely in a massive space.
- The model must infer ordering purely from statistical co-occurrence — requiring more data and more parameters.
How Positional Encodings Help
Positional encodings (PEs) inject structured, low-dimensional information about sequence order directly into the embeddings. - Adds a geometric bias to embeddings — nearby positions have nearby encodings. - Reduces the effective search space — positions are no longer independent random vectors. - Enables extrapolation: the sinusoidal pattern generalizes beyond training positions. - The model can compute relative positions via linear operations (e.g., dot products of PEs reflect distance).
Question: Explain maximum likelihood estimation for language modeling.
Answer
Tip
Training a neural LM (like a Transformer) by minimizing the negative log-likelihood (NLL) is the same as maximizing the likelihood:
Example
Sentence: “The cat sat on the mat.”
The MLE objective trains the model to maximize:
Question: What is negative log-likelihood? Write the per-token loss for GPT.
Answer
Tip
where
This is the heart of autoregressive language modeling — like GPT!
Question: Compare cross-entropy, perplexity, and BLEU. When is perplexity misleading?
Answer
Tip
- Cross-Entropy: Cross-entropy measures how well a probabilistic model predicts a target distribution — in LM, how well the model assigns high probability to the correct next tokens.
- Perplexity: Perplexity (PPL) is simply the exponentiation of the cross-entropy
- BLEU (Bilingual Evaluation Understudy): BLEU is an n-gram overlap metric for evaluating machine translation or text generation quality against reference texts
Perplexity is rephrasing cross-entropy in a more intuitive, more human-readable.
Perplexity = “How predictable is the language?”
BLEU = “How much does the output match a reference?”
Example:
Reference: “The cat is on the mat.”
Model output: “The dog is on the mat.”
→ Low perplexity (grammatical, fluent)
→ Low BLEU (wrong content) BLEU is non-probabilistic and reference-based — unlike cross-entropy and perplexity.
⚠️ When Perplexity Is Misleading???
Perplexity only measures how well the model predicts tokens probabilistically — not how meaningful or correct the generated text is.
- Different tokenizations or vocabularies
- A model with smaller tokens or subwords might have lower perplexity just because predictions are more granular, not actually better linguistically.
- Domain mismatch
- A model trained on Wikipedia might have low perplexity on Wikipedia text but produce incoherent answers to questions — it knows probabilities, not task structure.
- Human-aligned vs statistical objectives
- A model can assign high likelihood to grammatical but dull continuations (e.g., “The cat sat on the mat”) while rejecting creative or rare but correct continuations — good perplexity, poor real-world usefulness.
- Non-autoregressive or non-likelihood models
- For encoder-decoder or retrieval-augmented systems, perplexity may not correlate with generation quality because these models are not optimized purely for next-token prediction.
- Overfitting
- A model with very low perplexity on training data may memorize text, but generalize poorly (BLEU or human eval drops).
Question: Why is label smoothing used in LLMs? Derive its modified loss?
Answer
Tip
Label smoothing is used in LLMs to prevent overconfidence and improve generalization.
Instead of training on a one-hot target (where the correct token has probability 1 and all others 0), a small portion ε of that probability is spread across all other tokens.
So the true token gets (1 − ε) probability, and the rest share ε uniformly.
This changes the loss from the usual “−log p(correct token)” to a mix of:
(1 − ε) × lossfor the correct token, andε × average lossover all tokens.
Question: What is the difference between hard and soft attention?
Answer
Tip
- Hard attention → discrete, selective, non-differentiable.
- Soft attention → continuous, weighted, differentiable.
Fundamentals of Large Language Models (LLMs)
Question Bank
- Fundamentals of Large Language Models (LLMs)
LLM Basic
Tip
Question: What are the main open-source LLM families currently available?
Answer
Tip
- Llama: Decoder-Only
- Mistral: Decoder-Only (MoE in Mixtral)
- Gemma: Decoder-Only
- Phi: Decoder-Only
- Qwen: Decoder-Only (dense + MoE)
- DeepSeek: Decoder-Only (MoE in V2)
- Falcon: Decoder-Only
- OLMo: Decoder-Only
Question: What’s the difference between prefix decoder, causal decoder, and encoder-decoder architectures?
Answer
Tip
- Causal Decoder (Decoder-Only): Autoregressive model that generates text left-to-right, attending only to previous tokens.
- Prefix Decoder (PrefixLM): Causal decoder with a bidirectional prefix (input context) followed by autoregressive generation.
- Encoder-Decoder (Seq2Seq): Two separate Transformer stacks(Encoder & Decoder)
Causal Decoder
- Prompt
Translate to French: The cat is on the mat.
- Generation (autoregressive, causal mask):
Le [only sees “Le”]
Le chat [sees “Le chat”]
Le chat est [sees “Le chat est”]
Le chat est sur [sees up to “sur”]
Le chat est sur le [sees up to “le”]
Le chat est sur le tapis. [final]
- Summary
Cannot see future tokens
Cannot see full input bidirectionally — but works via prompt engineering
Prefix Decoder
- Input Format
[Prefix] The cat is on the mat. [SEP] Translate to French: [Generate] Le chat est sur le tapis.
- Attention
Prefix (The cat is on the mat. [SEP] Translate to French:) → bidirectional
Encoder-Decoder
Question: What is the training objective of large language models?
Answer
Tip
LLMs are trained to predict the next token in a sequence.
Question: Why are most modern LLMs decoder-only architectures?
Answer
Tip
Most modern LLMs are decoder-only because this architecture is the simplest, fastest, and most flexible for large-scale text generation. Below is the full reasoning, broken into the fundamental, engineering, and use-case levels.
- Decoder-only naturally matches the training objective
- Simpler architecture → easier scaling
- Better for long-context generation
- Fits universal multitask learning with a single text stream
- Aligns with inference needs
- streaming output
- token-by-token generation
- low latency
- high throughput
- continuous prompts
Question: Explain the difference between encoder-only, decoder-only, and encoder-decoder models.
Answer
Tip
- Encoder-only Models (BERT, RoBERTa, DeBERTa, ELECTRA)
- classification (sentiment, fraud detection)
- named entity recognition
- sentence similarity
- search / embeddings
- anomaly or pattern detection
- Decoder-only Models (GPT, Llama, Mixtral, Gemma, Qwen)
- Text generation
- Multi-task language modeling
- Anything that treats tasks as text → text in one stream
- Encoder–Decoder (Seq2Seq) Models (T5, FLAN-T5, BART, mT5, early Transformer models)
- Translation
- Summarization
- Text-to-text tasks with clear input → output mapping
Question: What’s the difference between prefix LM and causal LM?
Answer
Tip
- Causal LM: every token can only attend to previous tokens.
- Prefix LM: the prefix can be fully bidirectional, while the rest is generated causally.
| Feature | Causal LM | Prefix LM |
|---|---|---|
| Attention | Strictly left-to-right | Prefix: full; Generation: causal |
| Use case | Free-form generation | Conditional generation, prefix tuning |
| Examples | GPT, Llama, Mixtral | T5 (prefix mode), UL2, some prompt-tuning models |
| Future access? | No | Only inside prefix |
| Mask complexity | Simple | Mixed masks |
Layer Normalization Variants
Tip
Question: Comparison of LayerNorm vs BatchNorm vs RMSNorm?
Answer
Tip
| Norm | Formula | Pros | Cons |
|---|---|---|---|
| BatchNorm | Normalize across batch | Great for CNNs | Bad for variable batch / autoregressive decoding |
| LayerNorm | Normalize across hidden dim | Stable for Transformers | Slightly more compute than RMSNorm |
| RMSNorm | Normalize only scale | Faster, more stable in LLMs | No centering → sometimes slightly less expressive |
Question: What’s the core idea of DeepNorm?
Answer
Tip
DeepNorm keeps the Transformer stable at extreme depths by scaling the residual connections proportionally to the square root of the model depth.
Question: What are the advantages of DeepNorm?
Answer
Tip
DeepNorm = deep models that actually train and perform well, without tricks.
- Enables Extremely Deep Transformers (1,000+ layers)
- Superior Training Stability
- Improved Optimization Landscape
- Better Performance on Downstream Tasks
- No Architectural Overhead
- Robust Across Scales and Tasks
Question: What are the differences when applying LayerNorm at different positions in LLMs?
Answer
Tip
Pre-NormPost-Norm(Original Transformer, 2017): Normalizes after adding the residual.- Pros:
- Fairly stable for shallow models (<12 layers)
- Works well in classic NMT models
- Cons:
- Fails to train deep models (vanishing/exploding gradients)
- Poor gradient flow
- Not used in modern LLMs
- Pros:
- Pre-Norm (Current Standard in GPT/LLaMA): Normalize before attention or feed-forward
- Pros:
- Much more stable for deep Transformers
- Great training stability up to hundreds of layers
- Works well with small batch sizes
- Default in GPT-2/3, LLaMA, Mistral, Gemma, Phi-3, Qwen2
- Cons:
- Residual stream grows in magnitude unless controlled (→ RMSNorm or DeepNorm often added)
- Slightly diminished expressive capacity compared to Post-Norm (but negligible in practice)
- Pros:
- Sandwich-Norm: LayerNorm applied before AND after sublayers.
-
Pros:
- Extra stability & smoothness
- Improved optimization in some NMT models
-
Cons:
- Expensive (two norms per sublayer)
- Rarely used in large decoder-only LLMs
-
🧠 Why LayerNorm position matters
1. Training Stability
• Pre-Norm prevents exploding residuals
• Post-Norm accumulates errors → unstable for deep models
2. Gradient Flow
- Residuals in Pre-Norm allow gradients to bypass the sublayers directly.Question: Which normalization method is used in different LLM architectures?
Answer
Tip
Large decoder-only LLMs almost universally use RMSNorm + Pre-Norm.
Activation Functions in LLMs
Tip
Question: What’s the formula for the FFN (Feed-Forward Network) block?
Answer
Tip
-
Standard FFN Formula
l}}}$$W_1 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{mode -
Gated FFN in LLMs
Question: What’s the GeLU formula?
Answer
Tip
Gaussian Error Linear Unit (GeLU)
Question: What’s the Swish formula?
Answer
Tip
Swish is a smooth, non-monotonic activation.
Question: What’s the formula of an FFN block with GLU (Gated Linear Unit)?
Answer
Tip
Question: What’s the formula of a GLU block using GeLU?
Answer
Tip
Question: What’s the formula of a GLU block using Swish?
Answer
Tip
Question: Which activation functions do popular LLMs use?
Answer
Tip
Question: What are the differences between Adam and SGD optimizers?
Answer
Tip
Attention Mechanisms — Advanced Topics
Tip
Question: What are the problems with traditional attention?
Answer
Tip
Question: What are the directions of improvement for attention?
Answer
Tip
Question: What are the attention variants?
Answer
Tip
Question: What issues exist in multi-head attention?
Answer
Tip
Question: Explain Multi-Query Attention (MQA).
Answer
Tip
Question: Compare Multi-head, Multi-Query, and Grouped-Query Attention.
Answer
Tip
Question: What are the benefits of MQA?
Answer
Tip
Question: Which models use MQA or GQA?
Answer
Tip
Question: Why was FlashAttention introduced? Briefly explain its core idea.
Answer
Tip
Question: What are FlashAttention advantages?
Answer
Tip
Question: Which models implement FlashAttention?
Answer
Tip
Question: What is parallel transformer block?
Answer
Tip
Question: What’s the computational complexity of attention and how can it be improved?
Answer
Tip
Question: Compare MHA, GQA, and MQA — what are their key differences?
Answer
Tip
Cross-Attention
Tip
Question: Why do we need Cross-Attention?
Answer
Tip
Question: Explain Cross-Attention.
Answer
Tip
Question: Compare Cross-Attention and Self-Attention — similarities and differences.
Answer
Tip
Question: Provide a code implementation of Cross-Attention.
Answer
Tip
Question: What are its application scenarios?
Answer
Tip
Question: What are the advantages and challenges of Cross-Attention?
Answer
Tip
Transformer Operations
Tip
Question: How to load a BERT model using transformers?
Answer
Tip
Question: How to output a specific hidden_state from BERT using transformers?
Answer
Tip
Question: How to get the final or intermediate layer vector outputs of BERT?
Answer
Tip
LLM Loss Functions
Tip
Question: What is KL divergence?
Answer
Tip
Question: Write the cross-entropy loss and explain its meaning.
Answer
Tip
Question: What’s the difference between KL divergence and cross-entropy?
Answer
Tip
Question: How to handle large loss differences in multi-task learning?
Answer
Tip
Question: Why is cross-entropy preferred over MSE for classification tasks?
Answer
Tip
Question: What is information gain?
Answer
Tip
Question: How to compute softmax and cross-entropy loss (and binary cross-entropy)?
Answer
Tip
Question: What if the exponential term in softmax overflows the float limit?
Answer
Tip
Similarity & Contrastive Learning
Tip
Question: Besides cosine similarity, what other similarity metrics exist?
Answer
Tip
Question: What is contrastive learning?
Answer
Tip
Question: How important are negative samples in contrastive learning, and how to handle costly negative sampling?
Answer
Tip
- Advanced Topics in LLMs
Advanced LLM
Tip
Question: What is a generative large model?
Answer
Tip
Question: How do LLMs make generated text diverse and non-repetitive?
Answer
Tip
Question: What is the repetition problem (LLM echo problem)? Why does it happen? How can it be mitigated?
Answer
Tip
Question: Can LLaMA handle infinitely long inputs? Explain why?
Answer
Tip
Question: When should you use BERT vs. LLaMA / ChatGLM models?
Answer
Tip
Question: Do different domains require their own domain-specific LLMs? Why?
Answer
Tip
Question: How to enable an LLM to process longer texts?
Answer
Tip
- Fine-Tuning Large Models
General Fine-Tuning
Tip
Question: Why does the loss drop suddenly in the second epoch during SFT?
Answer
Tip
Question: How much VRAM is needed for full fine-tuning?
Answer
Tip
Question: Why do models seem dumber after SFT?
Answer
Tip
Question: How to construct instruction fine-tuning datasets?
Answer
Tip
Question: How to improve prompt representativeness?
Answer
Tip
Question: How to increase prompt data volume?
Answer
Tip
Question: How to select domain data for continued pretraining?
Answer
Tip
Question: How to prevent forgetting general abilities after domain tuning?
Answer
Tip
Question: How to make the model learn more knowledge during pretraining?
Answer
Tip
Question: When performing SFT, should the base model be Chat or Base?
Answer
Tip
Question: What’s the input/output format for domain fine-tuning?
Answer
Tip
Question: How to build a domain evaluation set?
Answer
Tip
Question: Is vocabulary expansion necessary? Why?
Answer
Tip
Question: How to train your own LLM?
Answer
Tip
Question: What are the benefits of instruction fine-tuning?
Answer
Tip
Question: During which stage — pretraining or fine-tuning — is knowledge injected?
Answer
Tip
SFT Tricks
Tip
Question: What’s the typical SFT workflow?
Answer
Tip
Question: What are key aspects of training data?
Answer
Tip
Question: How to choose between large and small models?
Answer
Tip
Question: How to ensure multi-task training balance?
Answer
Tip
Question: Can SFT learn knowledge at all?
Answer
Tip
Question: How to select datasets effectively?
Answer
Tip
Training Experience
Tip
Question: How to choose a distributed training framework?
Answer
Tip
Question: What are key LLM training tips?
Answer
Tip
Question: How to choose model size?
Answer
Tip
Question: How to select GPU accelerators?
Answer
Tip
- LangChain and Agent-Based Systems
LangChain Core
Tip
Question: What is LangChain?
Answer
Tip
Question: What are its core concepts?
Answer
Tip
Question: Components and Chains
Answer
Tip
Question: Prompt Templates and Values
Answer
Tip
Question: Example Selectors
Answer
Tip
Question: Output Parsers
Answer
Tip
Question: Indexes and Retrievers
Answer
Tip
Question: Chat Message History
Answer
Tip
Question: Agents and Toolkits
Answer
Tip
Long-Term Memory in Multi-Turn Conversations
Tip
Question: How can Agents access conversation context?
Answer
Tip
Question: Retrieve full history
Answer
Tip
Question: Use sliding window for recent context
Answer
Tip
Question
Answer
Tip
Practical RAG Q&A using LangChain
Tip
Question: (Practical implementation questions about RAG apps in LangChain)
Answer
Tip
- Retrieval-Augmented Generation (RAG)
RAG Basics
Tip
Question: Why do LLMs need an external (vector) knowledge base?
Answer
Tip
Question: What’s the overall workflow of LLM+VectorDB document chat?
Answer
Tip
Question: What are the core technologies?
Answer
Tip
Question: How to build an effective prompt template?
Answer
Tip
RAG Concepts
Tip
Question: What are the limitations of base LLMs that RAG solves?
Answer
Tip
Question: What is RAG?
Answer
Tip
Question: How to obtain accurate semantic representations?
Answer
Tip
Question: How to align query/document semantic spaces?
Answer
Tip
Question: How to match retrieval model output with LLM preferences?
Answer
Tip
Question: How to improve results via post-retrieval processing?
Answer
Tip
Question: How to optimize generator adaptation to inputs?
Answer
Tip
Question: What are the benefits of using RAG?
Answer
Tip
RAG Layout Analysis
Tip
Question: Why is PDF parsing necessary?
Answer
Tip
Question: What are common methods and their differences?
Answer
Tip
Question: What problems exist in PDF parsing?
Answer
Tip
Question: Why is table recognition important?
Answer
Tip
Question: What are the main methods?
Answer
Tip
Question: Traditional methods
Answer
Tip
Question: pdfplumber extraction techniques
Answer
Tip
Question: Why do we need text chunking?
Answer
Tip
Question: What are common chunking strategies (regex, Spacy, LangChain, etc.)?
Answer
Tip
RAG Retrieval Strategies
Tip
Question: Why use LLMs to assist recall?
Answer
Tip
Question: HYDE approach: idea and issues
Answer
Tip
Question: FLARE approach: idea and recall strategies
Answer
Tip
Question: Why construct hard negative samples?
Answer
Tip
Question: Random sampling vs. Top-K hard negative sampling
Answer
Tip
RAG Evaluation
Tip
Question: Why evaluate RAG?
Answer
Tip
Question: What are the evaluation methods, metrics, and frameworks?
Answer
Tip
RAG Optimization
Tip
Question: What are the optimization strategies for retrieval and generation modules?
Answer
Tip
Question: How to enhance context using knowledge graphs (KGs)?
Answer
Tip
Question: What are the problems with vector-based context augmentation?
Answer
Tip
Question: How can KG-based methods improve it?
Answer
Tip
Question: What are the main pain points in RAG and their solutions?
Answer
Tip
Question: Content missing
Answer
Tip
Question: Top-ranked docs missed
Answer
Tip
Question: Context loss
Answer
Tip
Question: Failure to extract answers
Answer
Tip
Question: Explain RAG-Fusion. Why it’s needed,Core technologies,Workflow, and Advantages
Answer
Tip
Question
Answer
Tip
Graph RAG
Tip
Question: Why do we need Graph RAG?
Answer
Tip
Question: What is Graph RAG and how does it work? Show a code example and use case.
Answer
Tip
Question: How to improve ranking optimization in Graph RAG?
Answer
Tip
- Parameter-Efficient Fine-Tuning (PEFT)
PEFT Fundamentals
Tip
Question: What is fine-tuning, and how is it performed?
Answer
Tip
Question: Why do we need PEFT?
Answer
Tip
Question: What is PEFT and its advantages?
Answer
Tip
Adapter Tuning
Tip
Question: Why use adapter-tuning?
Answer
Tip
Question: What’s the core idea behind adapter-tuning?
Answer
Tip
Question: How does it differ from full fine-tuning?
Answer
Tip
::::