expflow Pipeline HPO

Workflows

PDEBench competition workflow orchestration with expflow — three pipeline modes (full/fast/skip), distributed HPO, pruner integration, and ClearML HyperParameterOptimizer native mode.

Install

openclaw skills install expflow-pipeline-hpo

expflow PDEBench Pipeline & HPO

Orchestrate experiment workflows for the AI4S PDE competition using expflow. Three modes for three competition phases.

Triggers

  • User says "run HPO", "submit pipeline", "distributed experiment"
  • User says "competition sprint" or "fast iterate"
  • User asks about automating the train→eval→submit loop
  • User mentions needing to find best hyperparams

Installation

pip install "expflow-pde[pipeline]"

Available Pipeline Modes

Three pipeline modes, each mapped to a CLI command:

Mode A — Full (HPO → Train → Eval)

For the exploration phase of a competition task. Optuna finds best params via distributed clearml-agent trials, trains with best, then evaluates.

expflow pipeline submit-full train_task1.py \
    --queue default \
    --trials 50 --parallel 4 \
    --eval-script eval_task1.py \
    --metric seg_total --direction maximize

Flags used:

  • --trials N: total HPO trials
  • --parallel M: max concurrent trials (use GPU node count)
  • --metric: objective metric name prefixed METRIC: in script stdout
  • --pruner hyperband|median|percentile: early-stop poor trials
  • --study-name: Optuna study name (auto if omitted; persists to SQLite)
  • --skip hpo --skip eval: run train only within full skeleton

Mode B — Fast (Train → Eval)

For the competition sprint phase. You already know best params. Skip HPO, run directly with fixed args.

expflow pipeline submit train_task1.py \
    --queue default \
    --train-param lr=0.001 --train-param epochs=80 \
    --eval-script eval_task1.py \
    --eval-param sub_step=5

Flags:

  • --skip eval: train-only (just submit checkpoint)
  • --train-param key=val: injected as --key=val to training script
  • --eval-param key=val: injected as --key=val to eval script

Mode C — Flexible Skip

Override step inclusion on either mode:

expflow pipeline submit-full train_task1.py \
    --skip hpo --skip eval          # = train only
expflow pipeline submit-full train_task1.py \
    --skip train --skip eval         # = HPO only

HPO: Three Execution Modes

HPO (expflow optuna run) has three backends:

ModeFlagDescriptionBest for
Local(default)subprocess serial on CPU≤20 trials, quick test
Distributed--distributedask/tell + clearml Task cloneMulti-GPU, custom control
Optimizer--optimizer -OClearml HyperParameterOptimizerProduction, 50-200+ trials

Key flags across all HPO modes:

  • --pruner hyperband|median|percentile|none: ASHA pruner saves ~40% GPU time
  • --metric <name>: reads METRIC:<name>=<value> from script stdout
  • --direction maximize|minimize
  • --timeout <min>: safety cutoff

Script Requirements

The training/eval script must:

  1. Accept hyperparams as --key=value CLI arguments
  2. Output METRIC:<name>=<value> to stdout for objective capture (local mode)
  3. Report clearml scalars for distributed/optimizer mode:
    Task.current_task().report_scalar("Score", "seg_total", value, iteration=epoch)
    

Pitfalls

  • Pruner needs trial.report() calls during training. If the script only reports at the end, the pruner has nothing to prune on. Call trial.report(val_loss, epoch) at least every 10 epochs.
  • HyperParameterOptimizer needs the metric name in Title/Series format. If your metric is seg_total, it becomes title=seg_total, series=seg_total. If your clearml report_scalar is report_scalar("Score", "seg_total", v), pass --metric Score/seg_total.
  • Clearml-agent must be running on GPU nodes before submitting. Verify with expflow clearml workers or check Web UI.
  • _collect_one_trial polls every 5s — waits up to 60min per trial. If trials are expected to run longer, increase timeout_minutes.

Architecture Reference

Key files in expflow_pde/:

  • hpo.py — 3-mode HPO runner (local/distributed/optimizer)
  • pipeline.py — ExperimentPipeline class (fast/full modes)
  • cli_pipeline.pypipeline submit + pipeline submit-full
  • cli_optuna.pyoptuna run with all three backends

Related

  • experiment-lifecycle-governance — PIN, metrics registry, compare-scores, competition rules audit
  • pde-experiment-hyperparameters — PDEBench-specific hyperparameter reference
  • multi-agent-distributed-experiment-workflow — Hermes → OpenCode → clearml