Install
openclaw skills install autodl-trainOperates remote model training jobs on AutoDL Linux servers over SSH. Use when starting a training run, checking whether training is still alive, reviewing G...
openclaw skills install autodl-trainUse this skill for remote training operations on an AutoDL Linux server. It is designed for high-frequency workflows around "start training, watch progress, inspect resources, read logs, diagnose failures, and decide what to do next" while keeping execution constrained to one configured project directory.
epoch, step, loss, lr, grad_norm, val_loss, accuracy, mAP, and F1.Collect or confirm these values before running any script:
host: AutoDL server hostname or IP.port: SSH port, usually 22.username: Remote Linux username.project_path: Absolute project directory on the remote server, for example /root/autodl-tmp/your-project.env_name, env_activate, or venv_path.train_command: The training launch command, such as python train.py, python -m torch.distributed.run ..., or bash scripts/train.sh.AUTOCLAW_TRAIN_SSH_PASSWORD as an environment variable or local .env file when SSH key login is not available.Prefer a config file at config.example.json copied to a real file such as config.json, or environment variables based on .env.example.
project_path.rm -rf, reboot, shutdown, mkfs, or fork bombs.Read config.example.json and references/usage.md to understand the expected fields. Ask the user for any missing values instead of guessing.
Run scripts/remote_train.py to start a background job or build a resume command:
python scripts/remote_train.py --config config.json
python scripts/remote_train.py --config config.json --resume-from outputs/checkpoints/last.ckpt
Use this when the user asks to launch training, re-launch after interruption, or resume from a checkpoint.
Run scripts/check_status.py when the user asks whether training is still running:
python scripts/check_status.py --config config.json
This script combines process matching, nvidia-smi, and recent log updates to classify the run as running, stopped, failed, or unknown.
Run scripts/monitor_resources.py to summarize GPU/CPU/memory/disk usage:
python scripts/monitor_resources.py --config config.json
Use the human-readable bottleneck assessment in the output instead of pasting raw command output unless the user asks for raw data.
Run scripts/summarize_log.py in one of these modes:
python scripts/summarize_log.py --config config.json --action read --tail 200
python scripts/summarize_log.py --config config.json --action detect-failure --tail 400
python scripts/summarize_log.py --config config.json --action summarize --tail 400
Use read for recent excerpts and metrics, detect-failure for exception diagnosis, and summarize for a concise human-facing assessment with next steps.
scripts/remote_train.py: start training, optional resume templating, structured launch result.scripts/check_status.py: process/GPU/log-based training status.scripts/monitor_resources.py: GPU/CPU/memory/disk summary and bottleneck hints.scripts/summarize_log.py: read logs, detect failures, summarize convergence and next actions.scripts/common.py: shared config loading, SSH execution, safe path checks, remote helpers.scripts/log_utils.py: reusable log parsing, failure detection, trend analysis, recommendation logic.references/usage.md for setup steps, example configs, and example commands.references/troubleshooting.md when SSH, environment activation, logs, or training recovery fail.scripts/check_status.py before reading a long log.scripts/check_status.py and then scripts/summarize_log.py --action detect-failure.scripts/summarize_log.py --action summarize and include the recommendations from the script in the final response.scripts/remote_train.py --resume-from ... so the resume command is explicit and auditable.