nanogpt-training

v0.1.0

Train GPT-2 scale models (~124M parameters) efficiently on a single GPU. Covers GPT-124M architecture, tokenized dataset loading (e.g., HuggingFace Hub shard...

0· 76·0 current·0 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for lnj22/mhc-layer-impl-nanogpt-training.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "nanogpt-training" (lnj22/mhc-layer-impl-nanogpt-training) from ClawHub.
Skill page: https://clawhub.ai/lnj22/mhc-layer-impl-nanogpt-training
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install mhc-layer-impl-nanogpt-training

ClawHub CLI

Package manager switcher

npx clawhub@latest install mhc-layer-impl-nanogpt-training
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
The name/description (nanogpt training) match the contents: model architecture, tokenized dataset loading, optimizers, and a training loop. Required tooling referenced in SKILL.md (torch, huggingface_hub, einops, numpy) is exactly what you'd expect for this task.
Instruction Scope
Runtime instructions stay on-topic: they show how to download public HF token shards, build datasets via memmap, construct the model, and run mixed-precision training. There are no instructions to read unrelated system files, harvest environment variables, or call endpoints outside the expected external services (HuggingFace/GitHub).
Install Mechanism
This is instruction-only (no install spec). The SKILL.md suggests pip installing common ML packages; that's appropriate and proportional. No archives or remote executables are fetched beyond public Python packages and dataset files from HuggingFace.
Credentials
No environment variables, credentials, or config paths are required. The dataset downloads reference public repos (no auth). If you later point it at a private HF repo, HF credentials would be needed — but the skill itself does not request them.
Persistence & Privilege
always is false and the skill does not request any special persistent privileges or modifications to other skills. Autonomous invocation is allowed (platform default) but not combined with problematic privileges.
Assessment
This skill is a coherent, textual training guide that appears safe to inspect and use. Before running: (1) review dataset licenses (downloading large token shards can have legal/ethical implications); (2) run initial experiments on a tiny subset to validate code and resource usage; (3) be aware of resource/cost implications when using GPU clouds (Modal examples request A100); (4) only provide HF/GitHub credentials if you intentionally access private repos; and (5) if you plan to execute code from untrusted sources, do so in isolated environments (containers) and inspect code snippets carefully for any modifications before running.

Like a lobster shell, security has layers — review code before you run it.

latestvk97fgpw8ap5jgqmv349mabmfph84tgr9
76downloads
0stars
1versions
Updated 1w ago
v0.1.0
MIT-0

NanoGPT Training

Overview

Training GPT-2 scale models (~124M parameters) efficiently on a single GPU. It provides:

  • GPT-124M Architecture: Standard transformer with RoPE and modern optimizations
  • Tokenized Datasets: Loading pre-tokenized shards from HuggingFace Hub or local files
  • Modern Optimizers: Muon optimizer with Newton-Schulz orthogonalization
  • Mixed Precision: bfloat16 training on A100 for 2x speedup

Training options:

  • Baseline GPT: Standard residual connections
  • Experimental residual variants: Optional alternative residual schemes for stability/efficiency

Quick Reference

TopicReference
Model ArchitectureGPT Architecture
Data LoadingTokenized Data
OptimizersOptimizers
Training LoopTraining Loop
HyperparametersHyperparameters

Installation

pip install torch einops numpy huggingface_hub

Minimal Example

import modal

app = modal.App("gpt-training")

image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "torch", "einops", "numpy", "huggingface_hub"
)

@app.function(gpu="A100", image=image, timeout=3600)
def train():
    import torch
    from dataclasses import dataclass

    @dataclass
    class GPTConfig:
        block_size: int = 1024
        vocab_size: int = 50257
        n_layer: int = 12
        n_head: int = 12
        n_embd: int = 768
        dropout: float = 0.0
        bias: bool = False

    # Download data, build model, train
    # ... (see references for full implementation)

    return {"final_loss": final_loss}

@app.local_entrypoint()
def main():
    results = train.remote()
    print(results)

Common Imports

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.cuda.amp import autocast, GradScaler
from dataclasses import dataclass
from einops import rearrange, repeat, reduce
import numpy as np
import math

When to Use What

ScenarioApproach
Standard GPT trainingUse baseline model with standard residuals
Stability experimentsTry alternative residual variants or extra streams
Small experimentsUse T4/A10G GPU
Full trainingUse A100 with bfloat16
Custom dataModify the dataset loader class
Different model sizeAdjust GPTConfig parameters

Metrics to Monitor

MetricTypical SignalNotes
Validation lossSteady decreaseAbsolute value depends on dataset/tokenizer
Grad normModerate, stable rangeLarge spikes indicate instability
Training stabilitySmooth curvesFrequent spikes suggest LR/batch issues
ThroughputConsistent tokens/secUse for comparing configs

External Resources

Comments

Loading comments...