
Steps required to make your own on-prem “small” LLM
Sounds daunting, does it not?
Most people will tell you it is impossible unless you operate at the scale of Google, Microsoft, Meta or OpenAI. While valid if you aim to compete directly with their frontier models, in late 2025, it is relatively affordable to create your own Small Language Model (SLM). While it will be smaller than massive proprietary models like GPT-5 (which likely utilise Mixture of Experts architectures exceeding a trillion parameters), building a targeted model in the 1 to 8 billion parameter range is entirely feasible using high-end consumer hardware.
You have likely heard the requirement for multiple high-end GPUs, such as NVIDIA H100s, alongside massive RAM and top-tier CPUs.
Plus data. Massive amounts of it.
In reality, the barrier to entry has collapsed. The introduction of massive Unified Memory in Apple Silicon, high-VRAM consumer GPUs, and the open-source code explosion on GitHub has democratised the ability to train—not just run—these models.
Here is the step-by-step technical reality for training a model ab initio or performing deep fine-tuning on consumer silicon in 2026.
A. Defining the Model Architecture
Choose a Model Type: The Transformer architecture remains the standard. For 2026 workflows, you would likely look at Decoder-only architectures (similar to Llama or Mistral). These are optimised for generative tasks (predicting the next word) rather than understanding tasks (Encoders like BERT).
Design the Architecture: Your hardware choice dictates your model ceiling.
-
The Constraint: Your hardware choice determines the limits of your model. If you run on NVIDIA or AMD, you are constrained by Video RAM (fitting the model layers). If you run on Apple, you are constrained by compute speed (how long training takes), but you have massive capacity.
-
The Code: You do not typically write this from scratch. You go to GitHub. Repositories like facebookresearch/llama or mistralai/mistral-src contain the exact architecture code (including the inference and model python files) which you can clone and modify.
B. Data (Gathering, Preprocessing and Tokenisation)
Data Collection: You need a high-quality corpus. While web scraping (Common Crawl) is standard, the 2026 trend relies heavily on Synthetic Data. This involves using a larger “teacher” model (like GPT-4o or Llama 3.3) to generate high-quality, textbook-style data to train your smaller model.
GitHub Tools for Data:
-
Scraping: Use adbar/trafilatura (on GitHub) to scrape web text efficiently without the HTML junk.
-
Cleaning: Use huggingface/datatrove (on GitHub) for deduplication and filtering.
Tokenisation: This is often overlooked but critical. You cannot just feed text to a chip; it needs numbers.
-
How it works: You must train a Tokeniser (usually using Byte-Pair Encoding or BPE) on your specific dataset. It breaks words down into efficient sub-word units (tokens).
-
Tool: The standard library is huggingface/tokenizers on GitHub (Rust-based, extremely fast).
C. Setting Up the Computing Infrastructure
This is where the landscape has shifted most dramatically. You do not need a server farm; you need the right workstation.
Option 1: The High-Bandwidth CUDA Path (NVIDIA)
-
Hardware: 2x or 3x NVIDIA GeForce RTX 5090s. With 32GB of GDDR7 Video RAM per card (based on late 2025 specs), a dual-card setup gives you 64GB of Video RAM.
-
The Power Reality: A single 5090 can draw roughly 600 Watts. Three of them plus a CPU will pull over 2200 Watts. A standard 15 Amp wall socket cannot handle this. You have two options:
-
Power Limiting: Use software tools to clamp power draw to 70% (approx 420 Watts per card). You lose about 5% performance but gain stability.
-
Dual Power Supplies: You physically require two Power Supply Units (e.g., one 1600 Watt and one 1000 Watt) and likely a dedicated 20 Amp circuit in your wall.
-
-
Software: You will use the standard CUDA stack with PyTorch. Techniques like FlashAttention-3 (found at Dao-AILab/flash-attention on GitHub) are critical here to maximise the limited Video RAM efficiency.
Option 2: The Unified Memory Path (Apple Silicon)
-
Hardware: Apple Mac Studio with M3 Ultra. Configured with the massive 512GB Unified Memory option.
-
Why it works: Unlike PCs which split system RAM and Video RAM, the M3 Ultra allows the GPU to access the entire 512GB pool. You can load a full, unquantised 100 Billion parameter model and its training states entirely into memory.
-
Software: You will use MLX (ml-explore/mlx on GitHub), Apple’s array framework designed for Apple Silicon. It allows efficient model training directly on the Metal Performance Shaders (MPS) backend without the overhead of translating to CUDA.
Option 3: The Cost-Effective ROCm Path (AMD)
-
Hardware: 2x or 3x AMD Radeon RX 7900 XTX (or the 9000 series equivalent). With 24GB Video RAM per card, it remains the value leader.
-
Why it works: AMD’s ROCm (Radeon Open Compute) platform has matured significantly.
-
Software: You run PyTorch natively on Linux using the HIP runtime. The repository ROCmSoftwarePlatform/PyTorch on GitHub is your starting point.
D. Model Pre-training & Training Tools
Initialisation: Start with random initialisation of parameters (weights and biases).
The GitHub Ecosystem for Training: You rarely write raw PyTorch loops anymore. You clone specific “training harnesses” from GitHub:
-
Unsloth (unslothai/unsloth): The absolute gold standard for consumer GPUs in 2025/26. It manually optimises the backpropagation steps, making training 2-5x faster and using 70% less memory than standard PyTorch. Essential for single or dual GPU setups.
-
Axolotl (OpenAccess-AI-Collective/axolotl): A config-based trainer. You do not write code; you write a YAML file (specifying data path, learning rate, model type) and run it. It handles the multi-GPU orchestration for you.
-
Torchtune (pytorch/torchtune): PyTorch’s native library for easily fine-tuning LLMs.
Optimisation:
-
Efficiency: On consumer cards, you will use Mixed Precision Training (FP8 or BF16). This cuts memory usage in half compared to standard FP32 without losing accuracy.
-
Timeframe: Training a 3 Billion parameter model on 1 trillion tokens with dual 5090s will take approximately 1-2 weeks. On an M3 Ultra, expect 3-4 weeks due to lower raw compute throughput (TFLOPS) compared to dedicated GPUs.
E. Fine-Tuning for Specific Tasks (Optional)
If you do not build from scratch, you Fine-Tune.
-
LoRA (Low-Rank Adaptation): Instead of retraining the whole brain (all weights), you freeze the model and only train a tiny “adapter” layer.
-
QLoRA (Quantised LoRA): You compress the main model to 4-bit (making it tiny) and train the adapter on top. This allows you to fine-tune a massive 70 Billion parameter model on a single 24GB card. Use the bitsandbytes library from GitHub for the 4-bit quantisation.
F. Evaluation and Testing
Performance Metrics: Do not just trust the “loss curve”.
LLM-as-a-Judge: The 2026 standard is to use a superior model (like GPT-4) to grade your model’s answers. You feed your model 100 questions, get answers, and ask GPT-4 “On a scale of 1-10, how accurate is this?”. This correlates highly with human evaluation. Use LM Evaluation Harness (EleutherAI/lm-evaluation-harness on GitHub) to automate testing against benchmarks like MMLU.
G. Deployment
Integration:
-
NVIDIA/AMD: Export to ONNX format or keep as SafeTensors for inference engines like vLLM (vllm-project/vllm on GitHub), which is currently the fastest engine for production.
-
Apple: Convert to GGUF format. This is the standard for local runners. The tool llama.cpp (ggerganov/llama.cpp on GitHub) allows you to quantise and run these models efficiently on the CPU/GPU mix.
H. Maintenance and Updates
RAG (Retrieval-Augmented Generation): Instead of retraining the model every time facts change, you connect it to a database (Vector Database). When you ask a question, it searches your documents, finds the answer, and the LLM summarises it. You can find RAG pipelines like LangChain or LlamaIndex on GitHub.
Reference Guide: Acronyms & GitHub Resources
Essential Acronyms:
-
LLM / SLM: Large / Small Language Model.
-
VRAM: Video Random Access Memory. The “working memory” of your GPU.
-
BPE: Byte-Pair Encoding. The method used to chop text into numbers (tokens).
-
CUDA: Compute Unified Device Architecture. NVIDIA’s coding language.
-
ROCm: Radeon Open Compute. AMD’s version of CUDA.
-
FP8 / BF16: Floating Point 8 / Brain Float 16. Computer number formats.
-
LoRA: Low-Rank Adaptation.
-
GGUF: GPT-Generated Unified Format.
The GitHub “Must-Star” List:
-
For Training (Efficiency): unslothai/unsloth
-
For Training (Orchestration): OpenAccess-AI-Collective/axolotl
-
For Inference (Apple/CPU): ggerganov/llama.cpp
-
For Inference (NVIDIA Production): vllm-project/vllm
-
For User Interface: oobabooga/text-generation-webui or open-webui/open-webui
-
For Data Processing: huggingface/datatrove
Where to get Data & Models:
-
Hugging Face: The “GitHub of AI Data”. Go here for:
-
Datasets: HuggingFaceFW/fineweb (Massive web data), OpenAssistant/oasst2 (Chat data).
-
Models: Llama, Mistral, Qwen (Open-weights models).
-
-
arXiv.org: The source for all the latest technical papers.
Defining the LLM Model Architecture
Deciding on the design of a Transformer architecture for a Small Language Model (SLM) or Large Language Model (LLM) involves several key considerations.
These decisions impact the model’s ability to learn from data, its computational efficiency, and ultimately, its performance on various tasks. It depends entirely on what you want to develop the LLM for.
Here is a more detailed look at how you might decide on the design of the architecture in late 2025:
Determine Model Size
Layers: The number of decoder layers directly affects the model’s depth and its capacity for understanding complex reasoning. However, in 2026, we avoid the “thin and deep” structures of the past. Deep models are slow to run because each layer must be processed in order. Today, we prefer “wider” models with fewer layers to maximise parallelisation on GPUs. For a 2 to 3 billion parameter model, a target of 18 to 32 layers is standard.
Hidden Units: The size of the hidden layers (dimensionality) influences the model’s ability to represent information internally. Larger hidden layers offer more capacity but increase computational load and memory usage (VRAM). Standard sizes for this class of model are 2048, 2560, or 3072.
Attention Heads: The number of attention heads affects the model’s ability to focus on different parts of the input sequence simultaneously. In 2026, you should look at Grouped Query Attention (GQA). This reduces the memory footprint of the system, allowing your model to handle longer documents without crashing your consumer-grade hardware.
To simplify what we just described, let’s consider a Llama-style model with 3 billion parameters, where most parameters come from the transformer layers themselves. Each transformer layer consists of two main components: a self-attention mechanism and a position-wise feed-forward network. The number of parameters in each layer is influenced by several factors:
Size of the Model’s Hidden Layers (d_model): The width of the model. Number of Attention Heads: How many parallel processes scan the text. Feed-Forward Network Size: In modern models, we use SwiGLU activation functions, which require three linear projections rather than two, making the layers “denser” than older models like GPT-2.
Given these variables, the formula for calculating the approximate number of parameters in a single transformer layer involves summing the weights in the attention block and the feed-forward block.
Simplified Calculation
Assuming a standard configuration for a 3 Billion parameter model:
A hidden size (d_model) of 3072, A feed-forward intermediate size of roughly 8192 (due to SwiGLU scaling), 24 to 32 Attention heads.
A rough estimate of parameters per layer can be calculated. In this modern configuration, a single layer holds approximately 100 million parameters. This accounts for the attention projections and the massive feed-forward blocks used in 2026 architectures.
Given this estimate, a 3 billion parameter model would need approximately 28 to 30 layers.
Let’s do now a Reality Check
In practice, the actual number of layers for a 3 billion parameter model will differ slightly based on efficiency needs:
Model Architecture Variations: Different implementations adjust the hidden size or the feed-forward multiplier. For example, reducing the hidden size allows you to add more layers for better reasoning, but increasing the hidden size allows for faster processing. Vocab Size: Modern models use massive vocabularies (often 128,000 tokens). The “embedding layer” (the dictionary the model uses) can itself take up 400 to 500 million parameters, which counts towards your total.
Initial Conclusion
While the rough calculation suggests around 30 layers for a 3 billion parameter model, the actual design involves a careful balance. We do not use 200 layers for small models anymore, as it is computationally inefficient on GPUs.
For real-world applications, examining existing models of similar sizes (like Llama 3.2 or Mistral) gives you the best “recipe” to follow.
However, that being said – you can start with a scaled-down version (e.g., 16 layers) to test your pipeline before committing weeks of GPU time to the full training run.
Data Gathering and Preprocessing
Data gathering in late 2025 is no longer about scraping the internet blindly; it is about curation and synthesis. We have moved past the era of “more data is better” to “better data is better”.
Here is the updated, technically precise workflow for Data Gathering and Pre-processing in the 2026 era.
1. Data Collection: The Shift to Quality and Synthesis
Utilising “Platinum” Open Datasets: Forget raw Wikipedia dumps. In 2026, we rely on heavily curated datasets like FineWeb-Edu (Hugging Face) or RedPajama v2. These are subsets of the internet that have been filtered for educational value, removing the “slop” (SEO spam, low-quality forums) that plagued early models.
-
Synthetic Data (The Game Changer): For specialised domains (like coding or medical), we now use “Synthetic Data Generation”. This involves using a Frontier Model (like GPT-4o or Llama 3.3) to write textbook-quality examples for your smaller model to learn from. It is cleaner, denser, and more effective than human-written text for training Small Language Models (SLMs).
Selecting Datasets: Align data with your specific downstream task.
-
Multilingual: Use datasets like CulturaX which clean and combine data from dozens of languages.
-
Domain Specific: Do not just find “medical text”. Generate synthetic patient notes or scrubbed clinical trials using a teacher model to ensure privacy and accuracy.
Combining Sources (Data Mixing): This is now an exact science called “Data Mixing”. You do not just dump files together. You typically train on a ratio, for example: 50% General Knowledge (FineWeb), 30% Domain Specific (Your Data), and 20% Coding/Math (to improve reasoning capabilities).
2. Data Cleaning: Automated Pipelines
Pre-Curated Datasets: As noted, datasets on Hugging Face (look for the “verified” tag or high download counts) are usually arrow or parquet files ready for ingestion. They are already deduplicated and sanitised.
Custom Data Cleaning (The “Janitor” Work): If you bring your own internal data (documents, emails, logs), you must clean it strictly. In 2026, we use automated pipelines like DataTrove or NeMo Curator:
-
Deduplication (MinHash): Essential. If your model sees the same marketing boilerplate 50 times, it will overfit. We use “Fuzzy Deduplication” (MinHash) to remove near-duplicates.
-
PII Redaction: Automated Named Entity Recognition (NER) to strip names, emails, and phone numbers before training.
-
Heuristic Filtering: Removing lines that are too short, have excessive symbol ratios (code vomit), or likely contain toxicity.
3. Tokenisation: The Bridge to Numbers
Choosing a Strategy: It is almost exclusively Byte-Pair Encoding (BPE) in 2026. We use libraries like Tiktoken (from OpenAI) or HuggingFace Tokenizers (Rust-based).
-
Vocabulary Size: The trend has shifted to larger vocabularies. While GPT-4 used ~100k, newer SLMs often use 128k to 150k tokens. This compresses text better, meaning the model processes more information per second.
The Process:
-
Train the Tokeniser: You feed a sample of your specific data to the tokeniser so it learns your domain’s jargon (e.g., learning that “cholecystectomy” is one token, not five).
-
Encoding: Convert your cleaned text into integer sequences.
-
Packing: Concatenate these sequences into fixed-length chunks (e.g., 4096 or 8192 tokens) to maximise GPU efficiency during training.
4. Special Considerations for Local LLMs
Data Licensing: The “wild west” is over. Ensure your open datasets have permissive licenses (Apache 2.0 or MIT) if you intend to use the model commercially. Knowledge Cut-off: Your model only knows what it was trained on. Do not expect it to know today’s news unless you implement RAG (Retrieval Augmented Generation) later.
__________________________
Deep Dive: Preparing Your Own Data (Internal / On-Prem)
If you are mixing your proprietary company data with open datasets, follow this strict protocol.
i. Data Cleaning (Internal)
-
Normalisation: Convert fancy quotes to straight quotes, fix encoding errors (utf-8 only), and standardise whitespace.
-
Artifact Removal: Remove “Click here to unsubscribe” footers, legal disclaimers, and HTML remnants.
ii. Data Structuring & Formats
-
JSONL (JSON Lines): The standard for text data. Each line is a valid JSON object. It is easy to stream and does not require loading the whole file into RAM.
{“text”: “This is a training example.”}
{“text”: “This is another one.”}
- Parquet: The standard for large datasets. It is a column-oriented binary format that compresses heavily and loads instantly. Do not use CSV for training data; it is slow and handles newlines poorly.
iii. Vector Databases (For RAG, not Training) You mentioned Pinecone is expensive. For an on-prem 2026 setup, you have superior local options:
-
ChromaDB: Open-source, runs locally, purely Python/Rust based. Excellent for small-to-medium datasets.
-
pgvector: If you already use PostgreSQL, this extension turns your existing database into a vector store. It is fast, free, and sits on your own server.
-
Qdrant: A high-performance vector search engine written in Rust. It runs as a Docker container on your local machine and is incredibly efficient.
iv. Data Storage
-
NVMe is Mandatory: Do not store training data on mechanical HDDs (spinners). The GPU needs data faster than HDDs can provide. You need a fast NVMe SSD (PCIe Gen 4 or 5) to keep the “data loader” from becoming a bottleneck.
Summary: Don’t reinvent the wheel. Use FineWeb for general knowledge, generate Synthetic Data for specific skills, save everything as Parquet, and use Tiktoken to turn it into numbers.
Tokenization
Tokenisation is not just a preprocessing step; it is the fundamental translation layer between human language and machine computation. It involves converting raw text into numerical sequences (IDs) that the model’s mathematical operations can process. In late 2025, the debate on “how” to tokenize has largely been settled.
Here is the technical reality of tokenisation for modern Large Language Models (LLMs) like Llama 3, Mistral, and GPT-4o.
a) Choosing a Tokenisation Method
In the past, we considered word-based or character-based methods. For LLMs in 2026, those are effectively obsolete. The industry standard is exclusively Subword Tokenisation, specifically Byte-Pair Encoding (BPE).
Why Word-based failed: The English language (and others) is too vast. A word-based model needs a vocabulary of millions to function, which is computationally inefficient. Why Character-based failed: It loses semantic meaning. Predicting the letter ‘h’ after ‘t’ is easy; predicting the concept of “photosynthesis” from raw characters is computationally expensive for the model. The Solution (BPE): We split text into “subwords”. Common words like “apple” are single tokens. Complex or rare words like “antidisestablishmentarianism” are broken into smaller, meaningful chunks (anti, dis, establish, ment, arian, ism). This keeps the vocabulary efficient (usually 128,000 to 150,000 tokens) while retaining the ability to process any string of text.
b) Implementing the Tokenisation Process
You do not write a tokeniser script manually. You “train” a tokeniser on your dataset.
Preprocessing: We clean the text but less aggressively than before. We often want the model to understand code, Markdown, and even messy user inputs. Standardisation (NFC normalisation) is applied to ensure characters (like accents) are consistent. Training the Tokeniser: You feed your specific corpus (e.g., your medical data or internal docs) into the tokeniser algorithm. The algorithm starts with all individual characters. It iteratively merges the most frequently adjacent pair of characters (e.g., ‘t’ and ‘h’ become ‘th’). It repeats this millions of times until it reaches your target vocabulary size (e.g., 128k). Result: A JSON file that contains the specific dictionary (vocab) and merge rules your model will use.
c) Special Tokens & Chat Templates
In 2026, “Special Tokens” have evolved into complex Chat Templates.
Structural Tokens: Standard markers used to indicate the End of a Sequence or Padding (to make batches of data the same length). Control Tokens: Crucial for “Instruct” or “Chat” models. You effectively program the model’s behaviour using role-specific tokens. These are invisible markers that tell the model when the “User” is speaking, when the “System” is setting the rules, and when the “Assistant” should reply. These are not just separators; they trigger specific behavioural modes in the model’s weights.
d) Converting Tokens to IDs
The Lookup: Once trained, the tokeniser acts as a deterministic hash map.
Input: “Hello world” Process: The tokeniser breaks this into “Hello” (Token 15496) and ” world” (Token 995). Output: [15496, 995].
This vector of integers is the actual input fed into the GPU.
e) Handling Out-of-Vocabulary (OOV) Words
In older NLP, if a word wasn’t in the list, the model saw an “Unknown” marker and often failed or hallucinated.
The 2026 Reality: OOV is virtually non-existent. Modern BPE works on UTF-8 Bytes, not just characters. If a user types a word the model has never seen (or even a random string of emojis), the tokeniser falls back to the byte level, breaking the word down into individual byte tokens. The model can process anything you type, even if it has to read it letter-by-letter.
f) Tools and Libraries
Forget NLTK or spaCy for this task. They are excellent for linguistic analysis but are not used for training LLMs.
Hugging Face Tokenizers: The industry standard. Written in Rust for extreme speed, with Python bindings. It handles the BPE training and encoding for 99% of open-source models. Tiktoken (OpenAI): The specific BPE implementation used by GPT-4 and its successors. It is faster than Hugging Face’s implementation but less flexible. It is the go-to if you are using OpenAI APIs or cloning their specific architecture. SentencePiece (Google): A C++ library often used for multilingual models because it treats the input as a raw stream of characters, independent of spaces (useful for languages like Chinese or Japanese that do not use spaces).
Summary:
Tokenisation defines the “DNA” of your model. If you change the tokeniser, you must retrain the model from scratch. Therefore, selecting a robust, byte-level BPE tokeniser with a large vocabulary (128k+) is the first critical technical decision in your pipeline.
Hardware. Computer Infrastructure. Pricing
To be able to train a 3 billion parameter LLM on-premises in early 2026, within a strict “Single Power Source” limit (max 2200W / 10A household circuit), we must be extremely precise with our hardware selection. You cannot simply stack 4 GPUs and hope the breaker holds.
The choice between NVIDIA, AMD, and Apple Silicon defines not just your software stack, but whether you trip your house’s safety switch during a training run.
Hardware Setup: The Three Paths
Option 1: NVIDIA RTX 5090 (The Speed King) The RTX 5090 (released early 2025) is the new standard. With 32GB of GDDR7 VRAM per card, a dual-card setup offers 64GB of ultra-fast memory.
-
Power Draw: Each card draws roughly 500W-550W.
-
The Math: 2 Cards (1100W) + CPU/System (450W) = ~1550W.
-
Verdict: Safe. You can run two of these on a single standard Australian 10A socket (which peaks at 2400W). Do not attempt three.
Option 2: AMD Radeon RX 7900 XTX (The Value & Capacity Leader) This remains the smart budget play. It lacks CUDA but offers 24GB VRAM per card for roughly half the price of a 5090.
-
Power Draw: Each card draws roughly 355W.
-
The Math: 3 Cards (1065W) + CPU/System (450W) = ~1515W.
-
Verdict: Safe. You can run three of these on a single plug, giving you 72GB of VRAM. This is higher capacity than the NVIDIA build, allowing for larger batch sizes. (Note: 4 cards would push ~1900W, which is dangerously close to the continuous load limit of a 10A breaker).
Option 3: Apple Mac Studio M3 Ultra (The Silent Specialist) This is the game changer. In a PC, System RAM and Video RAM are separate. On Apple Silicon, they are one “Unified” pool. A Mac Studio with 128GB or 192GB of RAM allows the GPU to use nearly all of it.
-
Power Draw: Under 400W total.
-
Verdict: Trivial. You could run five of these on one circuit.
Hardware Requirements for a 3B Parameter Model
For a 3 Billion parameter model, the requirements are modest, but “training” requires far more RAM than just “running” (inference).
-
Inference (Running it): Needs only ~2GB VRAM.
-
Training (Building it): To train efficiently with a long context window (e.g. reading full documents), you need 40GB to 60GB of memory buffer.
-
NVIDIA: 2 x 5090s (64GB Total) is perfect.
-
AMD: 3 x 7900 XTXs (72GB Total) is the sweet spot.
-
Apple: A single Mac Studio (M3 Ultra) with 128GB RAM is overkill but comfortable.
-
Software and Frameworks
-
Deep Learning Frameworks:
-
PyTorch 2.5+: The standard. Supports FlashAttention-3 on NVIDIA 50-series.
-
ROCm 6.2+ for AMD: You must ensure you are running ROCm 6.2 or newer. This version brings native PyTorch support parity for the 7900 XTX.
-
MLX for Apple: You use Apple’s MLX framework (similar to PyTorch but built for Apple Silicon).
-
Pricing: Building the Factory (2026 Estimates)
Let’s price the three specific builds designed to sit under the 2200W limit.
Build A: The AMD “72GB Capacity” Build (Best for VRAM/$ and Power)
-
Motherboard: ASUS Pro WS TRX50-SAGE WIFI (Threadripper sTR5)
-
CPU: AMD Ryzen Threadripper 7960X (24-Core)
-
Memory: 256GB DDR5 ECC RDIMM
-
GPU: 3 x ASUS TUF Gaming Radeon RX 7900 XTX (24GB each)
-
Storage: 2 x Samsung 990 PRO 4TB NVMe M.2
-
Power: 1 x 1650W PSU (Sufficient for 3 cards + CPU)
-
Case & Cooling: Full Tower & 360mm AIO
-
Total Estimated Cost:
-
USD: ~$9,200
-
AUD: ~$15,500 (inc GST)
-
Build B: The NVIDIA “Speed” Build (Best for Raw Training Speed)
-
Motherboard/CPU/RAM/Storage: Same as above.
-
GPU: 2 x NVIDIA GeForce RTX 5090 (32GB each)
-
Power: 1 x 1650W PSU
-
Total Estimated Cost:
-
USD: ~$11,000
-
AUD: ~$18,500 (inc GST)
-
Build C: The Apple “turn-key” Lab (Easiest Setup)
-
Model: Mac Studio (2025/26 Model)
-
Chip: Apple M3 Ultra with 60-Core GPU
-
Memory: 128GB Unified Memory
-
Storage: 4TB SSD
-
Total Estimated Cost:
-
USD: ~$6,500
-
AUD: ~$11,000 (inc GST)
-
Final Verdict
For a 3 Billion Parameter Model: The Mac Studio M3 Ultra is the clear winner for a home/office environment. It provides 128GB of memory (double the NVIDIA build) for significantly less money. It draws less power than a gaming PC and requires no assembly.
For Future Scaling: If you plan to train larger models later, the AMD Build (3x 7900 XTX) wins on flexibility. You get 72GB of dedicated VRAM, and because it runs Linux/ROCm, it behaves more like a traditional server than the Mac.
So, not that expensive anymore, is it? For the price of a used sedan (AU$11,000 – $18,000), you can own a machine capable of training proprietary AI models that would have required a million-dollar cluster just five years ago.
Model Pre-Training
i. Initialize Parameters Stable Initialization: We begin by initializing the model’s weights. While “Xavier/Glorot” was the old standard, modern Transformer architectures (like Llama 3) often use modified schemes to ensure stability with Brain Float 16 (BF16). We typically use a truncated normal distribution with specific scaling factors for the embedding layers to preventing “loss spikes” early in training.
ii. Choose a Pre-training Objective Causal (Autoregressive) Language Modeling: This remains the gold standard. The model predicts the next token based only on the previous tokens. The Causal Mask: You must ensure the model cannot “cheat” by seeing the future. In the self-attention mechanism, we apply a Causal Mask (a lower-triangular matrix of 1s and 0s) which mathematically forces the attention score for any future token to be negative infinity (effectively zero probability).
iii. Optimization and Training AdamW or Lion: We almost never use vanilla Adam anymore.
-
AdamW (8-bit): The standard. It fixes weight decay handling. We often use the 8-bit Paged version (via
bitsandbyteslibrary) which reduces the optimizer’s memory footprint by 75% without losing accuracy. -
Lion (Evolved Sign Momentum): The 2026 contender for home-lab builds. It is a newer optimizer that is 3x more memory efficient than AdamW because it only tracks momentum, not variance. This allows you to fit a larger batch size on your RTX 5090.
Learning Rate Scheduler:
-
WSD (Warmup-Stable-Decay): The new standard over Cosine Decay. You “Warmup” the learning rate, keep it “Stable” (high) for 80% of training to learn fast, and then drastically “Decay” it at the very end. This allows you to stop training and continue later without ruining the model’s stability.
Gradient Accumulation:
-
The Necessity: On a dual-5090 setup, your “batch size” per card might only be 4. This is too small for stable learning.
-
The Fix: You use Gradient Accumulation. You run 32 tiny batches through the GPU, adding up their gradients in the background, and then take one single step. This simulates an Enterprise-grade batch size of 128 (4 * 32) using consumer hardware.
Mixed Precision Training (BF16):
-
The Standard: We have moved from Float16 (FP16) to Brain Float 16 (BF16). BF16 has the same dynamic range as full 32-bit math, meaning it rarely crashes from “overflows” (numbers getting too big).
-
Hardware Support: Both the RTX 5090 and Mac Studio (M3 Ultra) support native BF16 operations. You should always enable this in your training script (
torch.bfloat16).
Data Loading:
-
Streaming: You do not load 1TB of data into RAM. We use “Streaming Datasets” (like PyTorch IterableDatasets). The GPU grabs chunks of data from your NVMe SSD in real-time as it trains.
iv. Comments Software Environment: Ensure you are using PyTorch 2.5+ (or newer) to enable FlashAttention-3. This is a specific algorithm that speeds up attention calculation by 2-3x by reducing memory reads/writes. Checkpointing: Save your model every 500 steps. If your power flickers or the Python script crashes, you do not want to lose 3 days of computing time.
Fine-Tuning for Specific Tasks and F: Evaluation and Testing
i. Task-Specific Data (Instruction Tuning)
Data Preparation: Let us assume 20% of your dataset comes from your internal company sources (e.g., previous customer support tickets or technical manuals). Correction: You do not “vectorise” data for fine-tuning (that is for RAG). You format and tokenise it. The Format: In 2026, data must be structured into Chat Templates (e.g., ChatML). You format your internal data into specific “turns”. For example, you structure the text so the model clearly distinguishes between the “User” asking a question and the “Assistant” providing the answer. Mixing: You mix this 20% internal data with 80% high-quality “Instruction Following” datasets (like OpenHermes or Infinity-Instruct from Hugging Face) to ensure the model retains its ability to speak naturally while learning your specific domain.
Dataset Size and Quality: Quality over Quantity: For fine-tuning, you need surprisingly little data. 1,000 to 5,000 high-quality examples of “Question leading to Correct Answer” are often enough to significantly alter the model’s behaviour. “Slop” (bad data) here is fatal; if you feed it bad grammar, it learns bad grammar instantly.
ii. Fine-Tuning Process (The PEFT Revolution)
We rarely do “Full Fine-Tuning” (updating all weights) on consumer hardware anymore. It is slow and prone to “catastrophic forgetting” (where the model learns your data but forgets how to speak English).
The Standard: LoRA and QLoRA: LoRA (Low-Rank Adaptation): We freeze the model’s main weights (the “brain”). We attach tiny, trainable “Adapter” layers to the attention mechanism. We only train these adapters. QLoRA: We compress the frozen brain to 4-bit (saving massive VRAM) and train the adapters on top. This allows you to fine-tune a 70 Billion parameter model on a single 32GB GPU.
Learning Rate Adjustment: Unlike full pre-training, LoRA allows for aggressive learning rates. We often use 2e-4 for the adapters. (If doing full fine-tuning, you would use a tiny 1e-5).
Optimizer & Epochs: Optimizer: Use Paged AdamW 8-bit. This offloads optimizer states to system RAM if the GPU gets full, preventing crashes. Epochs: Keep it short. 1 to 3 epochs is usually maximum. Over-training on small data leads to the model memorizing the examples verbatim rather than learning the concept.
Example Fine-Tuning Loop (Concept using Unsloth/PyTorch): You do not typically modify the output head for GenAI. You strictly train the model to predict the next token in your specific “Chat Format”.
iii. Performance Metrics (The Death of BLEU)
Old metrics like BLEU or ROUGE (used for translation) are terrible for Chat Models. They check word overlap, not meaning.
Choosing Metrics: Perplexity: Useful as a sanity check (did I break the model?), but it doesn’t tell you if the model is smart. The 2026 Standard “LLM-as-a-Judge”: You use a smarter model (like GPT-4o or a dedicated “Judge” model like Prometheus-2) to evaluate your fine-tuned model. Process: You generate 50 answers from your new model. You feed the Question, the Answer, and the “Gold Answer” to GPT-4 and ask: “Grade this answer on a scale of 1-5 for accuracy and tone.” Benchmarks: Use IFEval (Instruction Following Evaluation) to see if your model actually follows rules (e.g., “Write a response without using the word ‘not'”).
Validation Set: Strictly separate your “Test” data. If you train on your test questions, you are cheating.
iv. Iterative Improvement
Hyperparameter Tuning: Rank (r) and Alpha: In LoRA, you tune the “Rank” (how smart the adapter is). Start with a rank of 16 and an alpha of 32. If the model is too dumb, increase the rank to 64.
Early Stopping: Watch the Validation Loss. The moment it starts going up (while Training Loss goes down), stop immediately. Your model is now memorising, not learning.
Checkpointing: Save Adapters Only. A LoRA adapter is tiny (approx 100MB). You don’t need to save the full 10GB model every time. You just save the small adapter file.
v. Additional Considerations
Hardware Utilization: FlashAttention-3: Ensure this is enabled. It drastically speeds up training on RTX 40/50 series cards. VRAM Management: If you hit Out of Memory (OOM), decrease Batch Size and increase Gradient Accumulation Steps.
Software Stack: Stick to the Unsloth library or Axolotl for fine-tuning. They handle all the VRAM optimisation, LoRA injection, and data formatting automatically, saving you from writing hundreds of lines of raw PyTorch code.
Step-by-step Technical Guide
This is the final piece of the puzzle. You have the hardware (Mac Studio or the AMD/NVIDIA workstation), but without the correct software stack and “hyper-parameters” (the settings for the brain), you just have expensive heaters.
Here is the technical manual for late 2025/early 2026, combining the installation protocols with the step-by-step execution guide for building a 3 Billion parameter model.
1. Software Installation (The “How-To”)
The installation commands differ entirely based on your hardware path.
Path A: The Apple Mac Studio (The Easy Route) Apple’s MLX framework has matured into the standard. It replaces PyTorch for Apple Silicon, compiling directly to Metal.
The Command: Open your terminal and run:
# 1. Install the MLX Language Model library (includes dependencies)
pip install mlx-lm
# 2. (Optional) Install Hugging Face transfer to speed up downloads
pip install hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1
Path B: The Linux Workstation (AMD 7900 XTX) AMD requires ROCm 6.2+ (released late 2024/early 2025). Do not use older versions (5.7) as they lack FlashAttention support.
The Command (Ubuntu 24.04):
# 1. Install ROCm 6.2 drivers (system level)
sudo apt update && sudo apt install amdgpu-install
sudo amdgpu-install –usecase=rocm,hiplibsdk
# 2. Create a Python environment (Conda is standard)
conda create -n llm-build python=3.12
conda activate llm-build
# 3. Install PyTorch with ROCm support (The critical step)
# Note: You must index the specific ROCm wheels or PyTorch will fail to see the GPUs
pip3 install torch torchvision torchaudio –index-url https://download.pytorch.org/whl/rocm6.2
# 4. Install essential libraries
pip install transformers datasets safetensors accelerate packaging
pip install -e ‘.[flash-attn,deepspeed]’
Path C: The Linux Workstation (NVIDIA 5090) NVIDIA uses CUDA 13 (standard for 50-series).
The Command:
# 1. Install Unsloth (The fastest trainer for NVIDIA)
pip install “unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git”
# 2. Install Flash Attention 3 (Specific for 50-series)
pip install flash-attn –no-build-isolation
2. The Build Workflow: From Scratch (Execution Phase)
Once the software is installed, you move to the actual build. This section outlines the process specifically for the AMD 7900 XTX Cluster, as it requires the most specific configuration tweaks to avoid crashing.
Step 1: Data Preparation Gather Data: Download datasets like FineWeb-Edu (general knowledge) and mix in your 20% proprietary internal data. Preprocess: Standardise everything to JSONL format. Tokenization: In 2026, we train a Byte-Level BPE tokenizer. Use the tokenizers library (Rust-based).
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
# Initialize BPE (Byte-Pair Encoding)
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
# Train with modern special tokens
trainer = trainers.BpeTrainer(
vocab_size=128000,
special_tokens=[“<|begin_of_text|>”, “<|end_of_text|>”, “<|pad|>”, “<|user|>”, “<|assistant|>”]
)
tokenizer.train([“your_dataset.txt”], trainer)
Step 2: Model Architecture Design Define the Model: For a 3 billion parameter target, use a Llama-style Decoder-Only architecture. Configuration:
-
Layers: 28
-
Hidden Dimension: 3072
-
Heads: 24
-
Context Window: 8192 (Note: Higher context requires more VRAM).
-
Activation: SwiGLU (Standard for 2026).
-
Positional Embeddings: RoPE (Rotary Positional Embeddings).
Step 3: Infrastructure Setup (The AMD Specifics) Distributed Backend: You must use the RCCL backend for AMD. The standard NCCL backend is for NVIDIA and will crash your script.
import torch.distributed as dist
# CRITICAL: Use ‘rccl’ for AMD GPUs, ‘nccl’ for NVIDIA
dist.init_process_group(backend=’rccl’)
Flash Attention: Ensure you enable FlashAttention-2 or 3. Without this, training will be 3x slower and consume 2x more memory.
Step 4: The Training Loop Precision: You must use BFloat16 (Brain Float 16). Standard Float16 is unstable for training large models.
# Optimizer with Fused kernels for speed
from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.95), fused=True)
# The Training Step
with torch.cuda.amp.autocast(dtype=torch.bfloat16):
loss = model(input_ids, labels=input_ids).loss
The “Warmup” Rule: You cannot start training at full speed (Learning Rate 3e-4). You must implement a “Linear Warmup” for the first 200-500 steps, or the model will diverge (output garbage) immediately.
Step 5: Monitoring & Evaluation Monitoring: Use Weights & Biases (WandB). It is the 2026 standard for multi-GPU runs as it tracks system thermals and VRAM usage remotely. Evaluation: Do not rely on “Validation Loss”. Use “LLM-as-a-Judge”. Every 1000 steps, generate text and have a larger model (like GPT-4o) grade the output accuracy.
Step 6: Saving the Model Serialization: We do not use .pth or pickle anymore due to security risks. Use SafeTensors.
from safetensors.torch import save_file
save_file(model.state_dict(), “model.safetensors”)
3. The Missing Details (Crucial “Hyper-parameters”)
These are the details often left out of broad guides that determine if the project succeeds or fails.
i. How many Epochs? (The “Overcooking” Rule) An Epoch is one full pass through your entire dataset.
-
For Pre-training (From Scratch): 1 Epoch. If you have enough data (trillions of tokens), seeing it once is enough.
-
For Fine-Tuning: 1 to 3 Epochs.
-
1 Epoch: The model learns the style but stays creative.
-
10+ Epochs: Do not do this. The model will overfit and start repeating your training data verbatim.
-
ii. Data Selection & “The Mix”
-
The Golden Ratio: 20% Internal / 80% General.
-
The Mix:
-
20% Your Data: (Cleaned, formatted).
-
40% Instruct Data: (OpenHermes, Infinity-Instruct) to teach it how to follow orders.
-
40% Raw Quality: (FineWeb-Edu) to maintain its English fluency.
-
iii. Context Length vs. VRAM A “3 Billion” parameter model is small, but the Context Window (how much text it reads at once) eats VRAM.
-
Training at 2048 tokens: Uses ~14GB VRAM. (Easy).
-
Training at 8192 tokens: Uses ~30GB VRAM. (Needs 5090 or 7900 XTX).
-
Training at 32k tokens: Uses ~80GB VRAM. (Requires the Mac Studio or multi-GPU AMD build).
iv. Checkpointing Strategy Do not wait until the end to save. If power fails on day 6 of a 7-day run, you lose everything. Implement “Rolling Checkpoints”:
-
Save every 500 steps.
-
Keep only the last 3 checkpoints to save disk space.
4. Reference: The Four Pillars of AI Software
It is crucial to distinguish between these four distinct software pipelines.
| Term | Analogy | Hardware Requirement | Software Tool |
| Pre-Training | “Birth” | Massive (4x GPUs / Mac Studio) | PyTorch (Native), Nanotron |
| Definition | Creating a model from scratch. Teaching it English/Logic. Takes weeks. | ||
| Fine-Tuning | “School” | Moderate (1x GPU / Mac) | Unsloth (NVIDIA), Axolotl (AMD), MLX (Apple) |
| Definition | Teaching an existing model a specific job. Takes hours. | ||
| RAG | “Library” | Low (CPU or weak GPU) | LangChain, LlamaIndex, ChromaDB |
| Definition | (Retrieval Augmented Generation). The model searches a database before answering. It does not learn; it just reads. | ||
| Inference | “Speech” | Low (Laptop / Edge) | vLLM (Server), Llama.cpp (Local) |
| Definition | Running the model to get answers. |
___________________________________________
Final Checklist
-
Safety: Ensure your electrician verifies the 15A circuit if running the 4-GPU Workstation build.
-
Storage: Ensure you have NVMe Gen 4 drives. Mechanical hard drives will cause the GPU to sit idle (0% utilisation) while waiting for data.
-
Backup: Hugging Face Hub is free. Create a private repository and upload your “Adapters” (the saved training files) there automatically.
About The Author

Bogdan Iancu
Bogdan Iancu is a seasoned entrepreneur and strategic leader with over 25 years of experience in diverse industrial and commercial fields. His passion for AI, Machine Learning, and Generative AI is underpinned by a deep understanding of advanced calculus, enabling him to leverage these technologies to drive innovation and growth. As a Non-Executive Director, Bogdan brings a wealth of experience and a unique perspective to the boardroom, contributing to robust strategic decisions. With a proven track record of assisting clients worldwide, Bogdan is committed to harnessing the power of AI to transform businesses and create sustainable growth in the digital age.














Leave A Comment