expflow Pipeline HPO

PDEBench competition workflow orchestration with expflow — three pipeline modes (full/fast/skip), distributed HPO, pruner integration, and ClearML HyperParameterOptimizer native mode.

diamond2nv@diamond2nv

Install

openclaw skills install @diamond2nv/expflow-pipeline-hpo

expflow PDEBench Pipeline & HPO

Orchestrate experiment workflows for the AI4S PDE competition using expflow. Three modes for three competition phases.

Triggers

User says "run HPO", "submit pipeline", "distributed experiment"
User says "competition sprint" or "fast iterate"
User asks about automating the train→eval→submit loop
User mentions needing to find best hyperparams

Installation

bash

pip install "expflow-pde[pipeline]"

Available Pipeline Modes

Three pipeline modes, each mapped to a CLI command:

Mode A — Full (HPO → Train → Eval)

For the exploration phase of a competition task. Optuna finds best params via distributed clearml-agent trials, trains with best, then evaluates.

bash

expflow pipeline submit-full train_task1.py \
    --queue default \
    --trials 50 --parallel 4 \
    --eval-script eval_task1.py \
    --metric seg_total --direction maximize

Flags used:

--trials N: total HPO trials
--parallel M: max concurrent trials (use GPU node count)
--metric: objective metric name prefixed METRIC: in script stdout
--pruner hyperband|median|percentile: early-stop poor trials
--study-name: Optuna study name (auto if omitted; persists to SQLite)
--skip hpo --skip eval: run train only within full skeleton

Mode B — Fast (Train → Eval)

For the competition sprint phase. You already know best params. Skip HPO, run directly with fixed args.

bash

expflow pipeline submit train_task1.py \
    --queue default \
    --train-param lr=0.001 --train-param epochs=80 \
    --eval-script eval_task1.py \
    --eval-param sub_step=5

Flags:

--skip eval: train-only (just submit checkpoint)
--train-param key=val: injected as --key=val to training script
--eval-param key=val: injected as --key=val to eval script

Mode C — Flexible Skip

Override step inclusion on either mode:

bash

expflow pipeline submit-full train_task1.py \
    --skip hpo --skip eval          # = train only
expflow pipeline submit-full train_task1.py \
    --skip train --skip eval         # = HPO only

HPO: Three Execution Modes

HPO (expflow optuna run) has three backends:

Mode	Flag	Description	Best for
Local	(default)	subprocess serial on CPU	≤20 trials, quick test
Distributed	`--distributed`	ask/tell + clearml Task clone	Multi-GPU, custom control
Optimizer	`--optimizer -O`	Clearml `HyperParameterOptimizer`	Production, 50-200+ trials

Key flags across all HPO modes:

--pruner hyperband|median|percentile|none: ASHA pruner saves ~40% GPU time
--metric <name>: reads METRIC:<name>=<value> from script stdout
--direction maximize|minimize
--timeout <min>: safety cutoff

Script Requirements

The training/eval script must:

Accept hyperparams as --key=value CLI arguments
Output METRIC:<name>=<value> to stdout for objective capture (local mode)

Report clearml scalars for distributed/optimizer mode:

python

Task.current_task().report_scalar("Score", "seg_total", value, iteration=epoch)

Pitfalls

Pruner needs trial.report() calls during training. If the script only reports at the end, the pruner has nothing to prune on. Call trial.report(val_loss, epoch) at least every 10 epochs.
HyperParameterOptimizer needs the metric name in Title/Series format. If your metric is seg_total, it becomes title=seg_total, series=seg_total. If your clearml report_scalar is report_scalar("Score", "seg_total", v), pass --metric Score/seg_total.
Clearml-agent must be running on GPU nodes before submitting. Verify with expflow clearml workers or check Web UI.
_collect_one_trial polls every 5s — waits up to 60min per trial. If trials are expected to run longer, increase timeout_minutes.

Architecture Reference

Key files in expflow_pde/:

hpo.py — 3-mode HPO runner (local/distributed/optimizer)
pipeline.py — ExperimentPipeline class (fast/full modes)
cli_pipeline.py — pipeline submit + pipeline submit-full
cli_optuna.py — optuna run with all three backends

experiment-lifecycle-governance — PIN, metrics registry, compare-scores, competition rules audit
pde-experiment-hyperparameters — PDEBench-specific hyperparameter reference
multi-agent-distributed-experiment-workflow — Hermes → OpenCode → clearml

expflow Pipeline HPO

Install

expflow PDEBench Pipeline & HPO

Triggers

Installation

Available Pipeline Modes

Mode A — Full (HPO → Train → Eval)

Mode B — Fast (Train → Eval)

Mode C — Flexible Skip

HPO: Three Execution Modes

Key flags across all HPO modes:

Script Requirements

Pitfalls

Architecture Reference

Related

Related skills