Training LLM From Zero

author:BZdate:2025-08-10

Objective
Environment Setup

Objective

The goal of this project is to design, implement, and train a small-scale Large Language Model (LLM) from scratch, progressing through the full training lifecycle:

Pre-training on large-scale unlabeled text.
Supervised Fine-Tuning (SFT) on high-quality instruction-following datasets.
Parameter-Efficient Fine-Tuning (LoRA) for resource-efficient adaptation.
Direct Preference Optimization (DPO) for aligning the model with human preferences.

The project aims to serve as a practical, hands-on implementation of LLM training concepts from recent research.

Environment Setup

macOS with M Series chip

{==‼️ MPS is not optimized for training ==}

1
    Testing on: macOS MPS device (M4, 64GB RAM)
2
    PyTorch version: 2.3.0
3
    MPS available: True
4

5
    Matrix 1024x1024: 10.40 TFLOPS | Time: 20.65ms
6
    Matrix 2048x2048: 13.45 TFLOPS | Time: 127.76ms
7
    Matrix 4096x4096: 13.49 TFLOPS | Time: 1018.53ms
8
    Matrix 8192x8192: 12.82 TFLOPS | Time: 8573.45ms
9
    Matrix 16384x16384: 9.37 TFLOPS | Time: 93871.68ms

windows with CUDA (recommended)

1
    Testing on:CUDA device (2080Ti, Memory 11G)
2
    PyTorch version: 2.8.0+cu129
3
    CUDA available: True
4

5
    Matrix 1024x1024: 65.62 TFLOPS | Time: 3.27ms
6
    Matrix 2048x2048: 634.46 TFLOPS | Time: 2.71ms
7
    Matrix 4096x4096: 4447.00 TFLOPS | Time: 3.09ms
8
    Matrix 8192x8192: 34163.30 TFLOPS | Time: 3.22ms
9
    Matrix 16384x16384: 199933.93 TFLOPS | Time: 4.40ms

python package for CUDA support
- torch PyTorch ‼️{== Be careful: CUDA version must match the PyTorch (12.6, 12.8, 12.9) ==}
  
  pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu129
- transformers Install
  - pip install transformers
  - or pip3 install --no-build-isolation transformer_engine[pytorch] CUDA Transformer Engine
- peft
  - pip install peft
cuda toolkit
- choose the right version for your system link
- install cudnn from link
- check your Nvidia driver and cuda version: nvidia-smi

Dataset

tokenizer dataset
pre-training dataset
sft (Supervised Fine-Tuning) dataset
dpo (Direct Preference Optimization) dataset