Skill flagged — suspicious patterns detected

ClawHub Security flagged this skill as suspicious. Review the scan results before using.

Deepspeed Finetune

v1.0.5

Fine-tune large language models using DeepSpeed on local or remote GPUs.

0· 117·0 current·0 all-time
Security Scan
VirusTotalVirusTotal
Suspicious
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description match the included files and functionality: scripts for training, config generators, monitoring, and a remote_train helper. Required binaries (python3, deepspeed, sshpass) are appropriate for local DeepSpeed runs and optional password-based remote SSH automation. No unrelated cloud credentials or unexpected tooling are requested.
Instruction Scope
SKILL.md instructs the agent to perform local and remote operations (auto-detect remote hardware via nvidia-smi, free, df, launch training, monitor logs) and to use subagents (sessions_spawn/sessions_yield) for remote tasks. It also shows passing REMOTE_SSH_PASSWORD via environment variables, generating/uploading SSH keys, and disabling StrictHostKeyChecking for automation. These actions are coherent for remote training but have clear security trade-offs (see user guidance). The SKILL.md references REMOTE_SSH_PASSWORD and remote file creation (.remote_train_session.json) even though no env vars were declared in the registry metadata—this is an explicit runtime usage rather than a static registry requirement.
Install Mechanism
No install spec is provided (instruction-only skill), and all code is bundled with the skill (no external downloads or extract steps). That lowers install-time risk—scripts run at runtime rather than pulling arbitrary remote binaries during install.
Credentials
The skill declares no required environment variables in the registry, which is reasonable, but the runtime instructions and examples rely on an environment variable REMOTE_SSH_PASSWORD when the user supplies a password. No unrelated secret tokens (cloud keys, API tokens) are requested. The number and type of environment interactions are proportionate to the stated remote-training purpose, but the skill relies on password passing and key generation which are sensitive operations and should be handled consciously by the user.
Persistence & Privilege
always:false and default autonomous invocation are normal. The skill will create local artifacts during remote workflows: a ControlMaster socket in a temp directory and a .remote_train_session.json file with connection metadata (claimed to be non-sensitive). It also recommends generating an ed25519 keypair with no passphrase and uploading the public key for passwordless login—this is functional but increases long-term access if the private key is stored insecurely. Nothing in the skill attempts to modify other skills or system-wide agent settings.
Assessment
This skill appears to do what it says (DeepSpeed fine-tuning, including remote training), but review and operate cautiously: - Review remote_train.py before use: it orchestrates SSH, installs, and key setup — verify there is no unexpected network exfiltration or telemetry. The source is listed in the SKILL.md homepage, but the registry source is 'unknown' so confirm code provenance. - Prefer SSH key auth over password-based automation (sshpass). The skill supports generating/uploading keys, but auto-generating a private key with no passphrase creates a persistent credential—store the private key securely and revoke it if the host is compromised. - Be aware StrictHostKeyChecking=no is used for automation: this disables SSH host-key validation and makes MITM attacks possible. If you must use password automation initially, add the host key to known_hosts afterwards (ssh-keyscan >> ~/.ssh/known_hosts) and switch to key-based auth. - Passwords passed via environment variables are common for ephemeral automation but can leak in process listings or logs on misconfigured systems. Provide REMOTE_SSH_PASSWORD only on trusted machines and avoid storing it on disk. - The skill will create a session file (.remote_train_session.json) and temporary ControlMaster sockets in /tmp — clean these up periodically (rm -rf /tmp/deepspeed_remote_ssh/ and remove the session file) if you are concerned about lingering access. - Test first on a non-sensitive or disposable remote VM to validate behavior and side effects (install steps, file writes, ports opened) before using on production hosts or with sensitive data. If you want higher assurance, ask the publisher for an auditable release (named maintainer, commit hashes) or run the skill code in a sandbox and review remote_train.py, ds_train.py, and monitor_training.py for any unexpected network connections or data uploads beyond standard model/dataset transfer.

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

Clawdis
Binspython3, deepspeed, sshpass
latestvk97aw2a35zyxcr5dfbzrd94b6183qpgh
117downloads
0stars
6versions
Updated 3w ago
v1.0.5
MIT-0

DeepSpeed Fine-tuning Skill

This skill enables efficient model fine-tuning using DeepSpeed with various optimization strategies.

Prerequisites

  • Python 3.8+
  • GPU(s) or accelerator(s) with DeepSpeed-supported backend (CUDA, ROCm, Intel XPU, etc.)
  • DeepSpeed: pip install deepspeed
  • Transformers, Datasets, PEFT (for LoRA support)
  • sshpass: sudo apt-get install sshpass (for remote training)

Plan Selection Workflow

Never auto-select a plan. List viable options based on user hardware and requirements, and let the user decide.

Step 1: Gather Information

Confirm the following with the user:

  • Target model: Model name and parameter count (e.g., Qwen2.5-7B)
  • Hardware environment:
    • GPU VRAM x count (e.g., "single 24GB GPU")
    • CPU core count
    • RAM size
    • Free disk space
    • NVMe SSD availability (affects ZeRO NVMe offload)
  • Training goal: Full fine-tuning or parameter-efficient? Dataset size? Expected quality?
  • Budget/time constraints: Acceptable training duration?

If the user only provides an SSH or remote machine address, connect first and auto-detect hardware (nvidia-smi, free -h, df -h, nproc).

Step 2: Evaluate Feasibility

Estimate VRAM requirements based on model size (bf16):

ParamsModel Weights (bf16)+ Adam Optimizer + Gradients
0.5B~1 GB~5 GB
1.5B~3 GB~15 GB
3B~6 GB~30 GB
7B~14 GB~70 GB
14B~28 GB~140 GB
32B~64 GB~320 GB
72B~144 GB~720 GB

Breakdown: Adam optimizer stores 2 fp32 state tensors (momentum + variance) = 8 bytes/param. Gradients = 2 bytes/param (bf16). Total approx. 10 bytes/param (5x model weight size).

Activation memory: Depends on sequence length and batch size, not model params alone.

  • Formula: activation approx. 34 x seq_len x hidden_size x batch_size x bytes_per_element
  • Example: 7B model (hidden=4096), seq_len=2048, batch_size=4, bf16 -> ~1.5 GB per layer; ~60 GB total (can dominate VRAM)
  • Gradient checkpointing reduces this by ~80% (recomputes instead of storing), but adds ~20% compute overhead
  • Rule of thumb: if seq_len x batch_size > 8192, activation memory likely exceeds model weights

LoRA/QLoRA: VRAM depends on rank, target modules, and layer dimensions — not directly proportional to total model params. See references/lora_guide.md for LoRA-specific memory estimation.

Step 2.5: Activation Checkpointing

If VRAM is tight, activation checkpointing is the most impactful knob — it can reduce activation memory by ~80%.

How it works: Instead of storing all intermediate activations for backprop, only save checkpoints at select layers. Remaining activations are recomputed during backward pass. Trades compute for memory.

Two ways to enable:

  1. HF Trainer flag (simplest, works out of the box):
python scripts/ds_train.py --gradient_checkpointing ...
  1. DeepSpeed config (fine-grained control):
{
  "activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": true,
    "contiguous_memory_optimization": true,
    "number_checkpoints": 4
  }
}
OptionEffectWhen to use
partition_activationsShard checkpoints across model-parallel GPUsMulti-GPU with model parallelism
cpu_checkpointingStore checkpoints in CPU RAM instead of GPUGPU memory very tight
contiguous_memory_optimizationReduce memory fragmentationLarge models, many checkpoints
number_checkpointsControl checkpoint frequency (fewer = less VRAM, more compute)Tune based on VRAM budget

Step 3: List Options

Based on the VRAM assessment, list all viable approaches. Example:

Based on your hardware (single 24GB GPU, 64GB RAM, 500GB disk),
Qwen2.5-7B has these training options:

Option A: LoRA Fine-tuning (Recommended)
  - VRAM needed: ~22 GB
  - Speed: Fast
  - Quality: Good for instruction alignment, style adaptation
  - Trainable params: ~20M (0.4% of total)

Option B: QLoRA Fine-tuning (Saves VRAM)
  - VRAM needed: ~12 GB
  - Speed: Medium (quantization/dequantization overhead)
  - Quality: Slightly below LoRA, but gap is small

Option C: Full Fine-tuning (Not feasible)
  - VRAM needed: ~56 GB (exceeds 24GB)
  - Requires ZeRO-2 + CPU offload, or larger GPU

Which option do you prefer?

Step 4: Hardware Insufficient? Make Recommendations

If no plan is viable on current hardware, recommend specs using generic hardware metrics (no brand names):

You want to fully fine-tune a 7B model, but current hardware (single 24GB GPU) is insufficient.
Recommended hardware specs:

Minimum:
  - GPU: single 80GB VRAM
  - CPU: 16+ cores
  - RAM: 128 GB+
  - Disk: 200 GB+ free space

Recommended:
  - GPU: 2x 80GB VRAM (ZeRO-2 doubles training speed)
  - CPU: 32+ cores
  - RAM: 256 GB+
  - Disk: 500 GB+ free space

Alternatively, use LoRA — 24GB VRAM is sufficient for 7B models.

Key Principles

  • Never auto-select and start training — always list options and wait for user confirmation
  • Recommend but don't decide — say "I recommend Option A because..." but let the user choose
  • Use generic hardware metrics — VRAM in GB, GPU count, CPU cores, RAM in GB, disk in GB. No brand names.
  • Leave VRAM headroom — recommend at least 20% buffer to avoid OOM
  • If user picks an infeasible option, warn them clearly rather than silently switching

Core Capabilities

1. Training Configuration

Generate DeepSpeed ZeRO configurations:

from scripts.generate_ds_config import generate_zero_config

# ZeRO Stage 2 with optimizer offloading
config = generate_zero_config(
    zero_stage=2,
    offload_optimizer=True,
    offload_device="nvme",
    nvme_path="/local_nvme"
)

2. Training Launch

Use the training launcher script:

python scripts/ds_train.py \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --dataset_path data/my_dataset \
  --output_dir ./outputs \
  --deepspeed assets/ds_config_zero2.json \
  --num_train_epochs 3 \
  --per_device_train_batch_size 4 \
  --learning_rate 2e-5 \
  --lora_r 16 \
  --lora_alpha 32

3. LoRA/QLoRA Integration

For parameter-efficient fine-tuning:

# LoRA config is auto-generated based on arguments
peft_config = {
    "peft_type": "LORA",
    "r": 16,
    "lora_alpha": 32,
    "target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"],
    "lora_dropout": 0.05,
    "bias": "none",
    "task_type": "CAUSAL_LM"
}

4. Multi-GPU Training

Use the deepspeed launcher for multi-GPU training (recommended over torchrun):

# Multi-GPU on single node
deepspeed --num_gpus=4 scripts/ds_train.py \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --deepspeed assets/ds_config_zero3.json \
  ...

# Multi-node
deepspeed --hostfile hosts.txt scripts/ds_train.py \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --deepspeed assets/ds_config_zero3.json \
  ...

5. Training Monitoring

Monitor training progress:

from scripts.monitor_training import TrainingMonitor

monitor = TrainingMonitor(log_dir="./outputs")
monitor.plot_loss()
monitor.get_latest_checkpoint()

6. Early Stopping

Automatically monitors eval loss and stops training early when there's no improvement across consecutive evaluations, then loads the best checkpoint.

Parameters:

  • --early_stopping_patience — How many consecutive evals without improvement to tolerate. Set to 0 to disable (default). Recommended: 3-10.
  • --early_stopping_threshold — Minimum eval loss improvement to count as an improvement. Default 0.0 (any decrease counts).

Example:

python scripts/ds_train.py \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_path tatsu-lab/alpaca \
  --use_peft True \
  --early_stopping_patience 5 \
  --early_stopping_threshold 0.001 \
  --eval_strategy steps \
  --eval_steps 100 \
  --num_train_epochs 3 \
  ...

Auto-configuration: When early_stopping_patience > 0, the script automatically:

  1. Enables load_best_model_at_end=True
  2. Sets metric_for_best_model=eval_loss, greater_is_better=False
  3. Aligns save_strategy with eval_strategy (synced saving is needed to restore best checkpoint)

Notes:

  • Must also set eval_strategy (e.g., steps + eval_steps), otherwise early stopping won't work
  • Don't set patience too low (<3) — early training fluctuations may cause premature stopping
  • For LoRA fine-tuning, patience=5 with eval_steps=100 typically works well

Remote Training

When training needs to run on a remote GPU server, see references/remote_training.md for the complete guide including agent guidelines, security model, and command reference.

Troubleshooting

OOM Errors

  • Reduce batch size or increase gradient accumulation steps
  • Enable gradient checkpointing: --gradient_checkpointing
  • Use ZeRO-3 with CPU/NVMe offloading
  • Reduce LoRA rank: --lora_r 8
  • See references/troubleshooting.md for detailed solutions

Slow Training

  • Ensure bf16/fp16 is enabled
  • Check GPU utilization with nvidia-smi
  • Use FlashAttention if available
  • Optimize data loading with --dataloader_num_workers
  • See references/troubleshooting.md for detailed solutions

Checkpoint Issues

  • Use --save_strategy steps with --save_steps
  • Enable --save_total_limit to cap checkpoint count
  • For ZeRO-3, use --zero3_save_16bit_model to save FP16 weights
  • See references/troubleshooting.md for detailed solutions

MPI Errors (multi-GPU only)

  • Single-GPU training does not need MPI
  • If you see MPI errors on single GPU, use python3 directly instead of deepspeed launcher
  • See references/troubleshooting.md for full MPI debugging guide

Single-GPU Strategy

References

Comments

Loading comments...