How to Fine-Tune Local LLMs

Updated May 2026
Fine-tuning adapts a pre-trained language model to your specific domain or task by training it on your own data. A fine-tuned 7B model can outperform a generic 70B model on specialized tasks because it learns the patterns, vocabulary, and reasoning specific to your domain. Modern techniques like QLoRA make fine-tuning practical on a single consumer GPU.

Step 1: Prepare Your Dataset

Fine-tuning requires a dataset of example interactions in the format you want the model to learn. The standard format is JSONL (JSON Lines) with each line containing a conversation: a system prompt, user message, and the ideal assistant response. Quality matters more than quantity. 500-1,000 high-quality, carefully curated examples typically produce better results than 10,000 noisy examples.

For domain adaptation (teaching the model about a specific field), examples should cover the range of queries your model will handle, use the terminology and style you want the model to adopt, and demonstrate the depth and accuracy of response you expect. Include edge cases and difficult examples, not just simple ones.

For task specialization (training the model for a specific format or workflow), examples should demonstrate the exact input-output pattern consistently. If you want the model to always respond in a specific JSON schema, every example should follow that schema precisely.

Step 2: Choose a Base Model

Select the pre-trained model closest to your target task. For general domain adaptation, Llama 3.1 8B or 70B are excellent base models with broad initial knowledge. For coding tasks, start with Codestral or a code-focused Llama variant. For multilingual tasks, Qwen provides a strong multilingual foundation.

Model size involves a tradeoff. Larger base models require more GPU memory for training and produce better results, but smaller models train faster and serve cheaper. For most fine-tuning projects, 7-8B parameter models offer the best balance of trainability and capability. Fine-tune a 70B model only when the 8B fine-tuned version does not meet quality requirements.

Step 3: Select a Fine-Tuning Method

Full fine-tuning updates all model parameters. It produces the highest quality results but requires the most memory: roughly 4x the model size in GPU VRAM (a 7B model needs approximately 56GB for full fine-tuning). This is impractical on consumer hardware for models larger than a few billion parameters.

LoRA (Low-Rank Adaptation) freezes the original model weights and trains small adapter matrices that modify the model behavior. Only these adapters (typically 1-10% of total parameters) are updated during training. Memory requirements drop to roughly 1.5x the model size. A 7B model can be LoRA fine-tuned on a 24GB GPU.

QLoRA (Quantized LoRA) combines 4-bit quantization of the base model with LoRA training. The base model loads in 4-bit precision while the adapter trains in full precision. This dramatically reduces memory: a 7B model can be QLoRA fine-tuned on a GPU with just 8-12GB of VRAM. Quality is 95-98% of full LoRA fine-tuning with 4x less memory.

For most users, QLoRA is the recommended approach. It makes fine-tuning accessible on consumer GPUs while producing excellent results.

Step 4: Configure and Train

The primary tools for local fine-tuning are Hugging Face Transformers with the PEFT (Parameter-Efficient Fine-Tuning) library, and Unsloth which provides optimized training scripts that run 2-5x faster than standard implementations.

Key hyperparameters to set: learning rate (typically 1e-4 to 5e-5 for LoRA), number of epochs (1-3 epochs is usually sufficient, more risks overfitting), batch size (as large as your GPU memory allows), and LoRA rank (16-64, higher rank captures more complexity but uses more memory).

Training time depends on dataset size and hardware. A 1,000-example dataset fine-tuning a 7B model with QLoRA on an RTX 4090 typically completes in 30-60 minutes. Larger datasets or models take proportionally longer.

Step 5: Evaluate and Deploy

After training, evaluate the fine-tuned model on held-out examples that were not in the training set. Compare responses against the base model to verify improvement. Check for overfitting by testing on queries similar to but not identical to training examples.

To deploy through Ollama, convert the adapter weights to GGUF format using llama.cpp conversion tools, create an Ollama Modelfile that references the merged model, and register it with ollama create. For vLLM, load the base model with the LoRA adapter specified as a separate parameter.

Key Takeaway

QLoRA makes fine-tuning practical on a single consumer GPU. Start with 500-1,000 high-quality examples, fine-tune a 7-8B base model, and evaluate against held-out data. A well-fine-tuned small model beats a generic large model on specialized tasks.