Peft Fine Tuning

Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when fine-tuning large models (7B-70B) with limited GPU memory, when you need to train <1% of parameters with minimal accuracy loss, or for multi-adapter serving. HuggingFace's official library integrated with transformers ecosystem.

MIT-0 · Free to use, modify, and redistribute. No attribution required.
1 · 1.7k · 5 current installs · 5 all-time installs
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description match the content: SKILL.md and references contain recipes for LoRA/QLoRA, adapter management, memory optimizations, and related tooling (transformers, peft, bitsandbytes). Required artifacts (model downloads, pip-installed packages) align with fine-tuning LLMs.
Instruction Scope
Instructions are focused on fine-tuning and troubleshooting. They instruct network activity (pip install, huggingface model downloads, cloning bitsandbytes), building bitsandbytes from source, and running conversion scripts — all expected for the domain but worth noting because they cause code download and local compilation which execute on the host.
Install Mechanism
No install spec in the skill bundle (instruction-only). The runtime guidance uses pip and GitHub, which is normal for Python ML workflows; nothing in the bundle performs arbitrary downloads itself.
Credentials
The skill declares no required environment variables, credentials, or config paths. The instructions may implicitly need access to network, disk, and GPU, which are proportionate to model download/training tasks.
Persistence & Privilege
Skill is instruction-only, always:false, and does not request persistent privileges or to modify other skills or agent configurations.
Assessment
This is a how-to guide for PEFT-style fine-tuning; it appears internally consistent. Before you run it: (1) be prepared for large network downloads (LLM weights can be many GB) and local disk use; (2) pip installing packages and building bitsandbytes from source will run code on your machine—only run these commands on trusted systems and inspect commands if you have concerns; (3) some models (e.g., meta-llama variants) may require explicit licensing or HuggingFace authentication—this skill does not request credentials but you may need them to access certain models; (4) follow GPU/CUDA compatibility notes in troubleshooting to avoid runtime failures. If you want to be extra cautious, run package installs in an isolated virtual environment or container and verify any third-party scripts (e.g., convert-hf-to-gguf.py or cloned build scripts) before executing them.

Like a lobster shell, security has layers — review code before you run it.

Current versionv0.1.0
Download zip
latestvk9787xem1w36zcv8nz7b0qkh2x7zyd2h

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

PEFT (Parameter-Efficient Fine-Tuning)

Fine-tune LLMs by training <1% of parameters using LoRA, QLoRA, and 25+ adapter methods.

When to use PEFT

Use PEFT/LoRA when:

  • Fine-tuning 7B-70B models on consumer GPUs (RTX 4090, A100)
  • Need to train <1% parameters (6MB adapters vs 14GB full model)
  • Want fast iteration with multiple task-specific adapters
  • Deploying multiple fine-tuned variants from one base model

Use QLoRA (PEFT + quantization) when:

  • Fine-tuning 70B models on single 24GB GPU
  • Memory is the primary constraint
  • Can accept ~5% quality trade-off vs full fine-tuning

Use full fine-tuning instead when:

  • Training small models (<1B parameters)
  • Need maximum quality and have compute budget
  • Significant domain shift requires updating all weights

Quick start

Installation

# Basic installation
pip install peft

# With quantization support (recommended)
pip install peft bitsandbytes

# Full stack
pip install peft transformers accelerate bitsandbytes datasets

LoRA fine-tuning (standard)

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType
from datasets import load_dataset

# Load base model
model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # Rank (8-64, higher = more capacity)
    lora_alpha=32,                 # Scaling factor (typically 2*r)
    lora_dropout=0.05,             # Dropout for regularization
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # Attention layers
    bias="none"                    # Don't train biases
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%

# Prepare dataset
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

def tokenize(example):
    text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
    return tokenizer(text, truncation=True, max_length=512, padding="max_length")

tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)

# Training
training_args = TrainingArguments(
    output_dir="./lora-llama",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    data_collator=lambda data: {"input_ids": torch.stack([f["input_ids"] for f in data]),
                                 "attention_mask": torch.stack([f["attention_mask"] for f in data]),
                                 "labels": torch.stack([f["input_ids"] for f in data])}
)

trainer.train()

# Save adapter only (6MB vs 16GB)
model.save_pretrained("./lora-llama-adapter")

QLoRA fine-tuning (memory-efficient)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4 (best for LLMs)
    bnb_4bit_compute_dtype="bfloat16",   # Compute in bf16
    bnb_4bit_use_double_quant=True       # Nested quantization
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B",
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for training (enables gradient checkpointing)
model = prepare_model_for_kbit_training(model)

# LoRA config for QLoRA
lora_config = LoraConfig(
    r=64,                              # Higher rank for 70B
    lora_alpha=128,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
# 70B model now fits on single 24GB GPU!

LoRA parameter selection

Rank (r) - capacity vs efficiency

RankTrainable ParamsMemoryQualityUse Case
4~3MMinimalLowerSimple tasks, prototyping
8~7MLowGoodRecommended starting point
16~14MMediumBetterGeneral fine-tuning
32~27MHigherHighComplex tasks
64~54MHighHighestDomain adaptation, 70B models

Alpha (lora_alpha) - scaling factor

# Rule of thumb: alpha = 2 * rank
LoraConfig(r=16, lora_alpha=32)  # Standard
LoraConfig(r=16, lora_alpha=16)  # Conservative (lower learning rate effect)
LoraConfig(r=16, lora_alpha=64)  # Aggressive (higher learning rate effect)

Target modules by architecture

# Llama / Mistral / Qwen
target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

# GPT-2 / GPT-Neo
target_modules = ["c_attn", "c_proj", "c_fc"]

# Falcon
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

# BLOOM
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

# Auto-detect all linear layers
target_modules = "all-linear"  # PEFT 0.6.0+

Loading and merging adapters

Load trained adapter

from peft import PeftModel, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM

# Option 1: Load with PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model = PeftModel.from_pretrained(base_model, "./lora-llama-adapter")

# Option 2: Load directly (recommended)
model = AutoPeftModelForCausalLM.from_pretrained(
    "./lora-llama-adapter",
    device_map="auto"
)

Merge adapter into base model

# Merge for deployment (no adapter overhead)
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./llama-merged")
tokenizer.save_pretrained("./llama-merged")

# Push to Hub
merged_model.push_to_hub("username/llama-finetuned")

Multi-adapter serving

from peft import PeftModel

# Load base with first adapter
model = AutoPeftModelForCausalLM.from_pretrained("./adapter-task1")

# Load additional adapters
model.load_adapter("./adapter-task2", adapter_name="task2")
model.load_adapter("./adapter-task3", adapter_name="task3")

# Switch between adapters at runtime
model.set_adapter("task1")  # Use task1 adapter
output1 = model.generate(**inputs)

model.set_adapter("task2")  # Switch to task2
output2 = model.generate(**inputs)

# Disable adapters (use base model)
with model.disable_adapter():
    base_output = model.generate(**inputs)

PEFT methods comparison

MethodTrainable %MemorySpeedBest For
LoRA0.1-1%LowFastGeneral fine-tuning
QLoRA0.1-1%Very LowMediumMemory-constrained
AdaLoRA0.1-1%LowMediumAutomatic rank selection
IA30.01%MinimalFastestFew-shot adaptation
Prefix Tuning0.1%LowMediumGeneration control
Prompt Tuning0.001%MinimalFastSimple task adaptation
P-Tuning v20.1%LowMediumNLU tasks

IA3 (minimal parameters)

from peft import IA3Config

ia3_config = IA3Config(
    target_modules=["q_proj", "v_proj", "k_proj", "down_proj"],
    feedforward_modules=["down_proj"]
)
model = get_peft_model(model, ia3_config)
# Trains only 0.01% of parameters!

Prefix Tuning

from peft import PrefixTuningConfig

prefix_config = PrefixTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,      # Prepended tokens
    prefix_projection=True       # Use MLP projection
)
model = get_peft_model(model, prefix_config)

Integration patterns

With TRL (SFTTrainer)

from trl import SFTTrainer, SFTConfig
from peft import LoraConfig

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(output_dir="./output", max_seq_length=512),
    train_dataset=dataset,
    peft_config=lora_config,  # Pass LoRA config directly
)
trainer.train()

With Axolotl (YAML config)

# axolotl config.yaml
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj
lora_target_linear: true  # Target all linear layers

With vLLM (inference)

from vllm import LLM
from vllm.lora.request import LoRARequest

# Load base model with LoRA support
llm = LLM(model="meta-llama/Llama-3.1-8B", enable_lora=True)

# Serve with adapter
outputs = llm.generate(
    prompts,
    lora_request=LoRARequest("adapter1", 1, "./lora-adapter")
)

Performance benchmarks

Memory usage (Llama 3.1 8B)

MethodGPU MemoryTrainable Params
Full fine-tuning60+ GB8B (100%)
LoRA r=1618 GB14M (0.17%)
QLoRA r=166 GB14M (0.17%)
IA316 GB800K (0.01%)

Training speed (A100 80GB)

MethodTokens/secvs Full FT
Full FT2,5001x
LoRA3,2001.3x
QLoRA2,1000.84x

Quality (MMLU benchmark)

ModelFull FTLoRAQLoRA
Llama 2-7B45.344.844.1
Llama 2-13B54.854.253.5

Common issues

CUDA OOM during training

# Solution 1: Enable gradient checkpointing
model.gradient_checkpointing_enable()

# Solution 2: Reduce batch size + increase accumulation
TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16
)

# Solution 3: Use QLoRA
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")

Adapter not applying

# Verify adapter is active
print(model.active_adapters)  # Should show adapter name

# Check trainable parameters
model.print_trainable_parameters()

# Ensure model in training mode
model.train()

Quality degradation

# Increase rank
LoraConfig(r=32, lora_alpha=64)

# Target more modules
target_modules = "all-linear"

# Use more training data and epochs
TrainingArguments(num_train_epochs=5)

# Lower learning rate
TrainingArguments(learning_rate=1e-4)

Best practices

  1. Start with r=8-16, increase if quality insufficient
  2. Use alpha = 2 * rank as starting point
  3. Target attention + MLP layers for best quality/efficiency
  4. Enable gradient checkpointing for memory savings
  5. Save adapters frequently (small files, easy rollback)
  6. Evaluate on held-out data before merging
  7. Use QLoRA for 70B+ models on consumer hardware

References

Resources

Files

3 total
Select a file
Select a file to preview.

Comments

Loading comments…