Install
openclaw skills install @zyyhhxx/sagemaker-training-jobSubmit ML training jobs to AWS SageMaker — package code, upload to S3, launch on GPU/CPU instances, poll status, download artifacts. Use when training machine learning models that need more compute than the local machine (GPU training, large datasets, parallel experiments). Supports PyTorch, TensorFlow, scikit-learn, XGBoost/LightGBM. Handles spot instances for cost savings. Triggers on "train on SageMaker", "GPU training", "submit training job", "cloud training", "SageMaker", "remote training".
openclaw skills install @zyyhhxx/sagemaker-training-jobSubmit ML training jobs to AWS SageMaker from the command line. Supports PyTorch, TensorFlow, scikit-learn, and XGBoost with managed spot training for cost savings.
boto3 Python package installed (pip install boto3). sagemaker recommended.aws configure / env vars (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)references/setup.md for exact policies:
.git, .env, venv, __pycache__, and other
non-essential files. Use --source-dir to explicitly scope what gets packaged.
Always review --dry-run output before submitting to production.references/setup.md.Follow the SageMaker training script contract: read data from SM_CHANNEL_TRAIN,
save model to SM_MODEL_DIR. See references/training-scripts.md for templates.
python3 scripts/sagemaker_train.py \
--job-name my-experiment-001 \
--script ./train.py \
--role arn:aws:iam::ACCOUNT:role/SageMakerRole \
--bucket my-sagemaker-bucket \
--instance-type ml.g5.xlarge \
--spot \
--framework pytorch \
--input-data s3://my-bucket/data/train/ \
--hyperparameters '{"epochs":"50","lr":"0.001"}' \
--output-dir ./results
The script packages your code, uploads to S3, submits the job, polls until
complete, and downloads model artifacts to --output-dir.
# Estimate before running
python3 scripts/sagemaker_cost.py --instance-type ml.g5.xlarge --duration 3600 --spot
# Check actual cost after job completes
python3 scripts/sagemaker_cost.py --job-name my-experiment-001
python3 scripts/sagemaker_list.py --max 5
python3 scripts/sagemaker_list.py --status Failed
| Flag | Purpose | Default |
|---|---|---|
--spot | Managed spot training (up to 70% savings) | off |
--instance-type | Compute instance | ml.g5.xlarge |
--max-runtime | Kill job after N seconds | 3600 |
--framework | pytorch, tensorflow, sklearn, xgboost | pytorch |
--image-uri | Custom Docker image (overrides framework) | auto |
--requirements | requirements.txt for extra deps | none |
--dry-run | Print config without submitting | off |
--no-wait | Submit and exit without polling | off |
--resume JOB | Reconnect to a running/completed job (skip submission) | — |
--source-dir | Directory with all training code | script's parent |
--input-data | S3 input(s), format: channel:s3://... | none |
--env | JSON environment variables | {} |
For tabular/Kaggle workloads:
ml.m5.2xlarge (CPU, $0.54/hr)ml.g4dn.xlarge (T4, $0.74/hr) — cheapest GPUml.g5.xlarge (A10G, $1.41/hr) — best price/performanceml.p3.2xlarge (V100, $4.28/hr)Always use --spot for non-urgent training — typical savings of 30-70%.
For autonomous agents running training jobs in a loop:
references/training-scripts.md--dry-run first to validate configsagemaker_train.py — it blocks until completion by default--output-dirFor parallel experiments, use --no-wait and poll with sagemaker_list.py.
Verify the entire pipeline works end-to-end (~$0.01, takes ~3 min):
python3 scripts/sagemaker_smoke_test.py \
--role arn:aws:iam::ACCOUNT:role/SageMakerTrainingExecutionRole \
--bucket my-sagemaker-bucket
This runs a local pre-flight, submits a minimal job to SageMaker, verifies
the downloaded model artifact, and checks cost. Use --keep to preserve output files.