Install
openclaw skills install deepspeed-finetuneFine-tune large language models using DeepSpeed on local or remote GPUs.
openclaw skills install deepspeed-finetuneThis skill enables efficient model fine-tuning using DeepSpeed with various optimization strategies.
pip install deepspeedsudo apt-get install sshpass (for remote training)Never auto-select a plan. List viable options based on user hardware and requirements, and let the user decide.
Confirm the following with the user:
If the user only provides an SSH or remote machine address, connect first and auto-detect hardware (nvidia-smi, free -h, df -h, nproc).
Estimate VRAM requirements based on model size (bf16):
| Params | Model Weights (bf16) | + Adam Optimizer + Gradients |
|---|---|---|
| 0.5B | ~1 GB | ~5 GB |
| 1.5B | ~3 GB | ~15 GB |
| 3B | ~6 GB | ~30 GB |
| 7B | ~14 GB | ~70 GB |
| 14B | ~28 GB | ~140 GB |
| 32B | ~64 GB | ~320 GB |
| 72B | ~144 GB | ~720 GB |
Breakdown: Adam optimizer stores 2 fp32 state tensors (momentum + variance) = 8 bytes/param. Gradients = 2 bytes/param (bf16). Total approx. 10 bytes/param (5x model weight size).
Activation memory: Depends on sequence length and batch size, not model params alone.
activation approx. 34 x seq_len x hidden_size x batch_size x bytes_per_elementLoRA/QLoRA: VRAM depends on rank, target modules, and layer dimensions — not directly proportional to total model params. See references/lora_guide.md for LoRA-specific memory estimation.
If VRAM is tight, activation checkpointing is the most impactful knob — it can reduce activation memory by ~80%.
How it works: Instead of storing all intermediate activations for backprop, only save checkpoints at select layers. Remaining activations are recomputed during backward pass. Trades compute for memory.
Two ways to enable:
python scripts/ds_train.py --gradient_checkpointing ...
{
"activation_checkpointing": {
"partition_activations": true,
"cpu_checkpointing": true,
"contiguous_memory_optimization": true,
"number_checkpoints": 4
}
}
| Option | Effect | When to use |
|---|---|---|
partition_activations | Shard checkpoints across model-parallel GPUs | Multi-GPU with model parallelism |
cpu_checkpointing | Store checkpoints in CPU RAM instead of GPU | GPU memory very tight |
contiguous_memory_optimization | Reduce memory fragmentation | Large models, many checkpoints |
number_checkpoints | Control checkpoint frequency (fewer = less VRAM, more compute) | Tune based on VRAM budget |
Based on the VRAM assessment, list all viable approaches. Example:
Based on your hardware (single 24GB GPU, 64GB RAM, 500GB disk),
Qwen2.5-7B has these training options:
Option A: LoRA Fine-tuning (Recommended)
- VRAM needed: ~22 GB
- Speed: Fast
- Quality: Good for instruction alignment, style adaptation
- Trainable params: ~20M (0.4% of total)
Option B: QLoRA Fine-tuning (Saves VRAM)
- VRAM needed: ~12 GB
- Speed: Medium (quantization/dequantization overhead)
- Quality: Slightly below LoRA, but gap is small
Option C: Full Fine-tuning (Not feasible)
- VRAM needed: ~56 GB (exceeds 24GB)
- Requires ZeRO-2 + CPU offload, or larger GPU
Which option do you prefer?
If no plan is viable on current hardware, recommend specs using generic hardware metrics (no brand names):
You want to fully fine-tune a 7B model, but current hardware (single 24GB GPU) is insufficient.
Recommended hardware specs:
Minimum:
- GPU: single 80GB VRAM
- CPU: 16+ cores
- RAM: 128 GB+
- Disk: 200 GB+ free space
Recommended:
- GPU: 2x 80GB VRAM (ZeRO-2 doubles training speed)
- CPU: 32+ cores
- RAM: 256 GB+
- Disk: 500 GB+ free space
Alternatively, use LoRA — 24GB VRAM is sufficient for 7B models.
Generate DeepSpeed ZeRO configurations:
from scripts.generate_ds_config import generate_zero_config
# ZeRO Stage 2 with optimizer offloading
config = generate_zero_config(
zero_stage=2,
offload_optimizer=True,
offload_device="nvme",
nvme_path="/local_nvme"
)
Use the training launcher script:
python scripts/ds_train.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--dataset_path data/my_dataset \
--output_dir ./outputs \
--deepspeed assets/ds_config_zero2.json \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--learning_rate 2e-5 \
--lora_r 16 \
--lora_alpha 32
For parameter-efficient fine-tuning:
# LoRA config is auto-generated based on arguments
peft_config = {
"peft_type": "LORA",
"r": 16,
"lora_alpha": 32,
"target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"],
"lora_dropout": 0.05,
"bias": "none",
"task_type": "CAUSAL_LM"
}
Use the deepspeed launcher for multi-GPU training (recommended over torchrun):
# Multi-GPU on single node
deepspeed --num_gpus=4 scripts/ds_train.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--deepspeed assets/ds_config_zero3.json \
...
# Multi-node
deepspeed --hostfile hosts.txt scripts/ds_train.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--deepspeed assets/ds_config_zero3.json \
...
Monitor training progress:
from scripts.monitor_training import TrainingMonitor
monitor = TrainingMonitor(log_dir="./outputs")
monitor.plot_loss()
monitor.get_latest_checkpoint()
Automatically monitors eval loss and stops training early when there's no improvement across consecutive evaluations, then loads the best checkpoint.
Parameters:
--early_stopping_patience — How many consecutive evals without improvement to tolerate. Set to 0 to disable (default). Recommended: 3-10.--early_stopping_threshold — Minimum eval loss improvement to count as an improvement. Default 0.0 (any decrease counts).Example:
python scripts/ds_train.py \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_path tatsu-lab/alpaca \
--use_peft True \
--early_stopping_patience 5 \
--early_stopping_threshold 0.001 \
--eval_strategy steps \
--eval_steps 100 \
--num_train_epochs 3 \
...
Auto-configuration: When early_stopping_patience > 0, the script automatically:
load_best_model_at_end=Truemetric_for_best_model=eval_loss, greater_is_better=Falsesave_strategy with eval_strategy (synced saving is needed to restore best checkpoint)Notes:
eval_strategy (e.g., steps + eval_steps), otherwise early stopping won't workpatience too low (<3) — early training fluctuations may cause premature stoppingpatience=5 with eval_steps=100 typically works wellWhen training needs to run on a remote GPU server, see references/remote_training.md for the complete guide including agent guidelines, security model, and command reference.
--gradient_checkpointing--lora_r 8nvidia-smi--dataloader_num_workers--save_strategy steps with --save_steps--save_total_limit to cap checkpoint count--zero3_save_16bit_model to save FP16 weightspython3 directly instead of deepspeed launcher