SageMaker Training Job

v1.0.2

Submit ML training jobs to AWS SageMaker — package code, upload to S3, launch on GPU/CPU instances, poll status, download artifacts. Use when training machin...

⭐ 1· 96·0 current·0 all-time

by@zyyhhxx

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for zyyhhxx/sagemaker-training-job.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "SageMaker Training Job" (zyyhhxx/sagemaker-training-job) from ClawHub.
Skill page: https://clawhub.ai/zyyhhxx/sagemaker-training-job
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Required binaries: python3
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install sagemaker-training-job

ClawHub CLI

Package manager switcher

npx clawhub@latest install sagemaker-training-job

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

high confidence

✓

Purpose & Capability

Name/description align with included scripts and docs: packaging/uploading source, submitting SageMaker jobs, polling status, downloading artifacts, listing jobs, and cost estimation. The included Python scripts implement the advertised functionality and the reference docs describe required IAM roles and S3 setup.

✓

Instruction Scope

SKILL.md and references limit actions to packaging source, calling AWS (boto3) APIs, and running local smoke tests. The instructions require AWS credentials (via standard boto3 chain) and reference only expected paths and endpoints (S3, SageMaker, CloudWatch). There are no instructions to read unrelated system files or exfiltrate data to unexpected endpoints. The smoke test and packaging steps do create temporary files and upload to S3 as expected.

✓

Install Mechanism

No install spec (instruction-only) — scripts rely on python3 and Python packages (boto3, optional sagemaker). Nothing is downloaded from arbitrary URLs or extracted; the skill ships its Python scripts and docs. This is a low-risk install pattern for this kind of tool.

ℹ

Credentials

Primary credential declared is AWS_DEFAULT_REGION (region), and the skill relies on boto3's normal credential chain (instance profile, AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY, or configured profile). The SKILL.md and references clearly explain the need for AWS credentials and specific IAM roles. It would be clearer if required.env explicitly listed the possible credential env vars, but the current setup (using the standard boto3 chain and recommending instance profiles) is proportionate to the skill's purpose.

✓

Persistence & Privilege

The skill is not always-enabled and allows user invocation. It does not request system-wide privileges or modify other skills. It performs normal actions (create S3 objects, call SageMaker APIs) with the provided AWS permissions; these are expected for the stated purpose.

Assessment

This skill appears to do exactly what it claims. Before installing or running it: 1) Be prepared to provide AWS credentials (prefer an EC2 instance profile or a scoped IAM user) and create two IAM roles with the least privilege needed for S3 and SageMaker as described; 2) Review the source packaging/dry-run output to avoid unintentionally uploading secrets (don’t point --source-dir at your home directory); 3) Run the smoke test in a controlled account/bucket to verify behavior and cost (it will submit a real SageMaker job and incur charges); 4) Ensure the Caller role is tightly scoped to your S3 bucket and the PassRole action is limited to the specific SageMaker execution role ARN. If you want stricter metadata, ask the maintainer to declare AWS credential env vars explicitly in requires.env so the platform makes credential needs clearer.

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

Binspython3

Primary envAWS_DEFAULT_REGION

latestvk97dhtj2h9dd1dkh4dzc4nj6xn848hwt

96downloads

1stars

3versions

Updated 3w ago

v1.0.2

MIT-0

SageMaker Training

Submit ML training jobs to AWS SageMaker from the command line. Supports PyTorch, TensorFlow, scikit-learn, and XGBoost with managed spot training for cost savings.

Prerequisites

boto3 Python package installed (pip install boto3). sagemaker recommended.
AWS credentials available — EC2 instance profile (recommended), or aws configure / env vars (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
S3 bucket for training artifacts
Two IAM roles configured — see references/setup.md for exact policies:
- Role A (Caller): SageMaker job management + S3 access + ECR image pull
- Role B (Execution): S3 data access + CloudWatch logs + ECR images

Security Notes

AWS credentials are never logged, embedded in scripts, or uploaded to S3. boto3 resolves credentials from the standard chain (instance profile → env → config file).
Source packaging excludes .git, .env, venv, __pycache__, and other non-essential files. Use --source-dir to explicitly scope what gets packaged. Always review --dry-run output before submitting to production.
IAM scope: Both caller and execution role policies should be scoped to your specific S3 bucket and SageMaker execution role ARN. See references/setup.md.

Quick Start

1. Write a training script

Follow the SageMaker training script contract: read data from SM_CHANNEL_TRAIN, save model to SM_MODEL_DIR. See references/training-scripts.md for templates.

2. Submit a training job

python3 scripts/sagemaker_train.py \
  --job-name my-experiment-001 \
  --script ./train.py \
  --role arn:aws:iam::ACCOUNT:role/SageMakerRole \
  --bucket my-sagemaker-bucket \
  --instance-type ml.g5.xlarge \
  --spot \
  --framework pytorch \
  --input-data s3://my-bucket/data/train/ \
  --hyperparameters '{"epochs":"50","lr":"0.001"}' \
  --output-dir ./results

The script packages your code, uploads to S3, submits the job, polls until complete, and downloads model artifacts to --output-dir.

3. Check cost

# Estimate before running
python3 scripts/sagemaker_cost.py --instance-type ml.g5.xlarge --duration 3600 --spot

# Check actual cost after job completes
python3 scripts/sagemaker_cost.py --job-name my-experiment-001

4. List recent jobs

python3 scripts/sagemaker_list.py --max 5
python3 scripts/sagemaker_list.py --status Failed

Key Options

Flag	Purpose	Default
`--spot`	Managed spot training (up to 70% savings)	off
`--instance-type`	Compute instance	ml.g5.xlarge
`--max-runtime`	Kill job after N seconds	3600
`--framework`	pytorch, tensorflow, sklearn, xgboost	pytorch
`--image-uri`	Custom Docker image (overrides framework)	auto
`--requirements`	requirements.txt for extra deps	none
`--dry-run`	Print config without submitting	off
`--no-wait`	Submit and exit without polling	off
`--resume JOB`	Reconnect to a running/completed job (skip submission)	—
`--source-dir`	Directory with all training code	script's parent
`--input-data`	S3 input(s), format: `channel:s3://...`	none
`--env`	JSON environment variables	{}

Instance Selection

For tabular/Kaggle workloads:

Gradient boosting (LightGBM/XGBoost): ml.m5.2xlarge (CPU, $0.54/hr)
Small neural nets: ml.g4dn.xlarge (T4, $0.74/hr) — cheapest GPU
Standard deep learning: ml.g5.xlarge (A10G, $1.41/hr) — best price/performance
Heavy training: ml.p3.2xlarge (V100, $4.28/hr)

Always use --spot for non-urgent training — typical savings of 30-70%.

Workflow Integration

For autonomous agents running training jobs in a loop:

Prepare data locally or upload to S3
Write training script following the contract in references/training-scripts.md
Use --dry-run first to validate config
Submit with sagemaker_train.py — it blocks until completion by default
Results download automatically to --output-dir
Parse metrics from the output for experiment tracking

For parallel experiments, use --no-wait and poll with sagemaker_list.py.

Smoke Test

Verify the entire pipeline works end-to-end (~$0.01, takes ~3 min):

python3 scripts/sagemaker_smoke_test.py \
  --role arn:aws:iam::ACCOUNT:role/SageMakerTrainingExecutionRole \
  --bucket my-sagemaker-bucket

This runs a local pre-flight, submits a minimal job to SageMaker, verifies the downloaded model artifact, and checks cost. Use --keep to preserve output files.

Comments

Loading comments...