Install
openclaw skills install jarvislabs-gpuAgent guide for running and monitoring GPU experiments with the jl CLI on JarvisLabs.ai.
openclaw skills install jarvislabs-gpujl) — Agent GuideInstall the JarvisLabs CLI as a tool:
uv tool install jarvislabs
Alternative Python install:
pip install jarvislabs
Verify auth with jl status --json before doing anything. If not logged in, use jl setup --token <token> --yes. You can also authenticate via export JL_API_KEY="...".
Use --help on any command to discover flags (e.g., jl run --help, jl create --help). If something goes wrong, use jl run logs, jl run status, and jl exec to diagnose — don't guess.
jl create/list/get/pause/resume/destroy/rename/ssh/exec/upload/download) = GPU instance lifecycle and access.jl run = managed job on an instance. Uploads code, sets up a Python environment, runs your script in the background with log tracking.jl exec = run any command on an instance. Use for system checks (nvidia-smi, ps, df), debugging failed runs, inspecting files, or any raw shell access. No environment setup, no tracking. This is your escape hatch when jl run doesn't cover your use case.jl create --gpu L4 --storage 40 --yes --json
jl create --gpu L4 --spot --yes --json
jl create --vm --cpu --yes --json
--gpu is required for GPU instances. Use --spot only for GPU containers, not GPU VMs or CPU VMs. CPU VMs are created with --vm --cpu; omit --vcpus/--ram to use the smallest available CPU plan from the backend. Run jl create --help for all available flags.
Instances have three states that matter: Running (billing active), Paused (compute billing stopped, storage billing continues, data persists), Destroyed (everything deleted).
jl pause <id> --yes --json # stop compute billing, keep data
jl resume <id> --yes --json # restart a paused instance
jl destroy <id> --yes --json # permanently delete
Resume caveats:
--spot when you want a paused GPU container to resume as spot. Without --spot, resume is on-demand.--vcpus and --ram if you want to change CPU size on resume.jl resume --help for all available flags (GPU swap, storage expansion, rename, etc.).SSH, exec, upload, and download only work on Running instances.
Valid region codes for new instances: IN2, EU1.
IN1 is winding down. New instances and filesystems can no longer be created in
IN1. ExistingIN1instances can still be resumed, paused, destroyed, and renamed; existingIN1filesystems can still be listed, resized, and removed. Guide users withIN1resources to the migration doc: https://docs.jarvislabs.ai/in1-migration.
If --region is omitted, the CLI picks a region based on GPU availability.
| Constraint | Detail |
|---|---|
| EU1 | H100 and H200 only, single-GPU launches only right now, 100 GB minimum storage (auto-bumped) |
| VM template | IN2 and EU1 only, requires at least one SSH key, 100 GB minimum storage |
Run jl gpus to check GPU availability and pricing. Output shows GPU Containers and GPU VMs tables with separate availability for each. Spot prices are shown only for GPU containers.
Run jl resources when you also need CPU VM availability and pricing. It shows GPU containers, GPU VMs, and CPU VMs, with one shared available/unavailable legend at the end.
How to read jl gpus --json availability:
num_free_devices: free GPUs on that server. These can be used for normal creates, and also for spot creates when spot_price is present.effective_num_free_devices: GPUs available for on-demand creates on that server, including GPUs currently used by spot instances that can be preempted.workload_type tells which launch type the row belongs to:
"container" means use it for normal GPU container creates."vm" means use it for GPU VM creates.null means the same row applies to both containers and VMs.Container instances expose default HTTP ports (each gets its own HTTPS URL):
| Port | Service |
|---|---|
| 8889 | JupyterLab (url field) |
| 7007 | IDE (vs_url field) |
| 6006 | Available on generic templates like pytorch (endpoints[0]) |
VM instances (jl create --gpu ... --vm) get SSH-only access. VMs require at least one SSH key registered (jl ssh-key add). Use ssh_command from jl get <id> --json.
To expose a service (FastAPI, Gradio, etc.), bind to 0.0.0.0:6006 — it's accessible via endpoints[0] on generic templates. Use --http-ports "7860,8080" at creation or resume to expose custom ports. Custom port URLs appear in endpoints after the default 6006 entry.
Run jl get <id> --json to find all service URLs (url, vs_url, endpoints).
jl run worksjl run uploads your code to an instance, sets up a Python environment, and runs your script in the background with log and exit code tracking. You need either --on <machine_id> (existing instance) or --gpu <type> (creates a fresh instance).
run_id is tracked locally under ~/.jl/runs/. All run management commands (logs, status, stop, list) depend on these local records. Start and monitor runs from the same machine.
| Target | What happens |
|---|---|
train.py | Uploads to <home>/train.py, runs in <home>/ with shared venv at $HOME/.venv |
. or ./project with --script train.py | Rsyncs the directory to <home>/<dirname>/, runs inside it with project venv at <home>/<dirname>/.venv |
No target, command after -- | No upload. Runs from ~. If $HOME/.venv exists (from a previous file run), its bin/ is prepended to PATH so python and pip resolve to venv versions. Otherwise uses system Python. |
Only .py and .sh file targets are supported. For other file types, use a directory target or jl upload + jl exec. Directory targets require rsync installed locally.
Note: File targets with the same basename overwrite each other on the remote (e.g., foo/train.py and bar/train.py both land at /home/train.py). Use directory targets for projects with nested structure.
Pass script arguments after --:
jl run train.py --on <id> --json --yes -- --epochs 50 --lr 0.001
jl run manages a Python venv on the remote instance. Template packages (torch, etc.) are inherited via --system-site-packages — no need to install them. Venvs persist under the remote home directory across pause/resume.
Venv locations:
$HOME/.venv. All file runs share it — deps installed for one script are available to all.<home>/<dirname>/.venv. Isolated per project.$HOME/.venv exists from a previous file run, python and pip automatically resolve to it via PATH prepend.How dependencies get installed:
requirements.txt or pyproject.toml (with [project]), deps are installed automatically. No flag needed.--requirements requirements.txt if you need extra packages.--requirements <file> — overrides auto-detection. Uploads and installs the specified file instead.--setup <command> — runs a shell command before your script (e.g., --setup "pip install flash-attn"). Runs inside the venv for file/dir targets, raw for command mode.# Directory — auto-detects requirements.txt
jl run . --script train.py --on <id> --json --yes
# Single file — pass requirements explicitly
jl run train.py --on <id> --requirements requirements.txt --json --yes
# Extra setup command
jl run . --script train.py --on <id> --setup "pip install flash-attn" --json --yes
Command mode — when you pass a raw command after -- with no file or directory target. Useful when code already exists on the instance (e.g., uploaded via jl upload, written via jl exec, or left by a previous run). If $HOME/.venv exists from a prior file run, its bin/ is prepended to PATH so python and pip resolve to venv versions. You still get jl run log tracking (logs, status, stop), which is the main advantage over jl exec. --requirements is not supported in command mode.
Important: Command mode runs from ~ (the remote shell home). Use absolute paths or cd explicitly for scripts in specific directories.
jl run --on <id> --json --yes -- python3 /home/train.py
jl run --on <id> --json --yes -- sh -lc 'cd /home && torchrun --nproc_per_node=2 train.py'
jl run train.py --on <machine_id> --json --yes
jl run . --script train.py --on <machine_id> --requirements requirements.txt --json --yes
Lifecycle flags (--keep, --pause, --destroy) are not allowed with --on — the instance is not touched after the run.
jl run . --script train.py --gpu L4 --keep --json --yes
jl run . --script train.py --gpu L4 --spot --keep --json --yes
Creates a new instance, uploads code, runs the script. Additional flags: --spot (fresh GPU containers only), --vm (VM instead of container, auto-bumps storage to 100GB, disallows --template and --http-ports), --template (default: pytorch; run jl templates --json to list available), --storage (default: 40GB), --num-gpus (default: 1), --region, --http-ports.
Lifecycle rules for fresh instances:
--json or --no-follow: --keep is required. The CLI rejects --pause and --destroy because it returns immediately and cannot apply lifecycle actions later. Use --keep and have the agent pause or destroy the instance after the run completes.--json or --no-follow (human mode): the CLI stays attached, streams logs, and applies lifecycle when the run finishes. Default lifecycle is --pause.Use separate jl create when you need to inspect GPU availability, reuse machines across runs, or attach filesystems/scripts.
The primary monitoring command:
jl run logs <run_id> --tail 50
Always use --tail N — without it, the entire log file is returned and can be enormous.
The output includes a header and footer with run state (in non-follow, non-JSON mode):
--- run r_abc | machine 123 | running ---
step=100 loss=2.31
step=200 loss=2.11
--- still running | log: /home/jl-runs/r_abc/output.log ---
When done, the footer shows the final state:
--- succeeded | exit code: 0 | log: /home/jl-runs/r_abc/output.log ---
Or on failure:
--- failed | exit code: 1 | log: /home/jl-runs/r_abc/output.log ---
If the instance is paused, missing, or SSH is unavailable, jl run logs fails before printing any output. Use jl run status <run_id> --json to check those states.
jl run ... --json --yes — extract run_id and machine_id from JSONsleep 15 && jl run logs <run_id> --tail 30 — if footer says failed, fix and retry immediatelysleep 120 && jl run logs <run_id> --tail 50still running → repeat step 3succeeded | exit code: 0 → download resultsfailed | exit code: N → read error, fix, start a new runCadence: 60-120s (short experiments), 180-300s (long training), 300-600s (very long runs).
jl run status <run_id> --json
Returns run state, machine_id, exit_code, lifecycle_policy, launch_command, and more. Without --refresh, jl run list shows state as "saved" (a sentinel, not a real run state). Use --refresh or --status to get live state.
jl run stop <run_id> --json
Kills the entire process group (training script + all child processes). Escalates to SIGKILL if the process doesn't exit after TERM.
jl exec <id> -- nvidia-smi
jl exec <id> -- ps -ef
jl exec <id> -- df -h
Prefer raw output for jl exec and jl run logs — easier to read and parse. Use --json when you need machine-readable state: create, get, list, run start, run status.
Exit code of the remote command is propagated. For pipes or shell syntax, wrap in sh -lc:
jl exec <id> -- sh -lc 'grep "loss" /path/to/log | tail -5'
jl upload <id> ./local /remote # upload file or directory
jl download <id> /remote ./local # download file
jl download <id> /remote ./local -r # download directory
Default destinations: upload without dest → remote home directory. Download without dest → ./<basename> in current local directory.
The remote home directory (/home/ on containers, /home/<user>/ on VMs) persists. Everything else is ephemeral.
Persists:
$HOME/.venv (shared venv for file runs) and <project>/.venv (per-project venv for directory runs)/home/jl_fs/)<home>/jl-runs/<run_id>/Lost on pause:
apt-get, global pip packages outside the home directory)/tmp, /root, etc.)Use --setup for system-level reinstalls (e.g., apt-get). Python packages in the venv persist across pause/resume. For recurring system setup, use startup scripts (jl scripts add).
<home> is /home/ on containers, /home/<user>/ on VMs.
jl run): <home>/<filename> (e.g., train.py → /home/train.py)jl run): <home>/<directory_name>/jl upload): <home>/<filename><home>/.venv/<home>/<directory_name>/.venv/<home>/jl-runs/<run_id>/Attach a filesystem at creation with --fs-id <id>. Attach a startup script with --script-id <id> (and --script-args). These flags work on both jl create and jl resume.
jl templates --json # list available templates
jl ssh-key list --json # list registered SSH keys
jl ssh-key add <pubkey-file> --name x # add SSH key (required for VMs)
jl scripts list --json # list startup scripts
jl filesystem list --json # list filesystems
jl filesystem create --name x --storage 100 --json # create filesystem
Filesystem caveats:
jl filesystem edit) may return a new fs_id. Always use the returned ID.fs_id exists before creating/resuming, but does not validate region match. Ensure they match yourself.# 1. Check GPUs and create instance
jl gpus --json
jl create --gpu L4 --storage 50 --yes --json
# 2. Start detached run
jl run . --script train.py --on <machine_id> --requirements requirements.txt --json --yes
# 3. Early check (catch import/syntax/pip failures fast)
sleep 15 && jl run logs <run_id> --tail 30
# 4. Steady-state monitoring (repeat until footer shows succeeded or failed)
sleep 120 && jl run logs <run_id> --tail 50
# 5. Download results (use /home/<user>/ for VMs instead of /home/)
jl download <machine_id> /home/results ./results -r
# 6. Cleanup
jl pause <machine_id> --yes --json
For fresh instances without a pre-created instance:
# Creates instance inline, runs detached — agent must clean up after
jl run . --script train.py --gpu L4 --keep --json --yes
# ... monitor with jl run logs ...
jl pause <machine_id> --yes --json
When --json is active, CLI validation and API failures are emitted as {"error": "..."} to stdout.
Not all non-zero exits use that shape. jl exec --json returns its own structured payload with stdout, stderr, and exit_code fields.
Agent rule:
error key, treat it as a CLI failureexit_code, state, run_exit_code)jl run logs --follow — blocks forever, will timeout. --json is also incompatible with --follow.--json when starting runs — it returns immediately. Without --json, the CLI streams logs and blocks.--tail N — can return megabytes of output.--keep, --pause, --destroy) with --on — they are rejected. Only for fresh instances.--pause or --destroy with --json for fresh instances — rejected. Use --keep --json and clean up yourself.jl exec for long-running tasks — it blocks until the command finishes. Use jl run which runs in the background with log tracking.jl run list without --refresh — state shows as "saved" (stale). Use --refresh or --status for live state.machine_id is stable after jl resume — it may return a new ID. Always use the returned ID.Every command supports --help for full flag details:
jl create --help jl run --help jl ssh-key --help
jl resume --help jl run logs --help jl filesystem --help