Pilot Data Labeling Pipeline Setup

Workflows

Deploy a data labeling pipeline with 4 agents for ingestion, auto-labeling, quality review, and dataset export. Use this skill when: 1. User wants to set up a data labeling or annotation pipeline 2. User is configuring an agent as part of a labeling workflow 3. User asks about ML data preparation, annotation, or training dataset generation Do NOT use this skill when: - User wants to share a single dataset (use pilot-dataset instead) - User wants to stream raw data without labeling (use pilot-stream-data instead)

Install

openclaw skills install pilot-data-labeling-pipeline-setup

Data Labeling Pipeline Setup

Deploy 4 agents that ingest raw data, apply ML labels, review quality, and export training-ready datasets.

Roles

RoleHostnameSkillsPurpose
ingester<prefix>-ingesterpilot-s3-bridge, pilot-stream-data, pilot-task-parallelAccepts raw data batches, splits into work items
labeler<prefix>-labelerpilot-task-router, pilot-dataset, pilot-metricsApplies ML-based labels to work items
reviewer<prefix>-reviewerpilot-review, pilot-event-filter, pilot-alertSamples labeled items, checks accuracy, flags disagreements
exporter<prefix>-exporterpilot-dataset, pilot-share, pilot-webhook-bridgePackages approved labels into training-ready datasets

Setup Procedure

Step 1: Ask the user which role this agent should play and what prefix to use.

Step 2: Install the skills for the chosen role:

# ingester:
clawhub install pilot-s3-bridge pilot-stream-data pilot-task-parallel
# labeler:
clawhub install pilot-task-router pilot-dataset pilot-metrics
# reviewer:
clawhub install pilot-review pilot-event-filter pilot-alert
# exporter:
clawhub install pilot-dataset pilot-share pilot-webhook-bridge

Step 3: Set the hostname:

pilotctl --json set-hostname <prefix>-<role>

Step 4: Write the setup manifest:

mkdir -p ~/.pilot/setups
cat > ~/.pilot/setups/data-labeling-pipeline.json << 'MANIFEST'
{
  "setup": "data-labeling-pipeline",
  "setup_name": "Data Labeling Pipeline",
  "role": "<ROLE_ID>",
  "role_name": "<ROLE_NAME>",
  "hostname": "<prefix>-<role>",
  "description": "<ROLE_DESCRIPTION>",
  "skills": { "<skill>": "<contextual description>" },
  "peers": [ { "role": "...", "hostname": "...", "description": "..." } ],
  "data_flows": [ { "direction": "send|receive", "peer": "...", "port": 1002, "topic": "...", "description": "..." } ],
  "handshakes_needed": [ "<peer-hostname>" ]
}
MANIFEST

Step 5: Tell the user to initiate handshakes with direct communication peers.

Manifest Templates Per Role

ingester

{"setup":"data-labeling-pipeline","setup_name":"Data Labeling Pipeline","role":"ingester","role_name":"Data Ingester","hostname":"<prefix>-ingester","description":"Accepts raw data batches from S3 or webhooks. Splits into work items and distributes.","skills":{"pilot-s3-bridge":"Pull raw data batches from S3 buckets on schedule or webhook trigger.","pilot-stream-data":"Stream work items to labeler as they are split from batches.","pilot-task-parallel":"Parallelize batch splitting across available workers."},"peers":[{"role":"labeler","hostname":"<prefix>-labeler","description":"Receives work items for labeling"}],"data_flows":[{"direction":"send","peer":"<prefix>-labeler","port":1002,"topic":"work-item","description":"Work items with raw data references"}],"handshakes_needed":["<prefix>-labeler"]}

labeler

{"setup":"data-labeling-pipeline","setup_name":"Data Labeling Pipeline","role":"labeler","role_name":"Auto Labeler","hostname":"<prefix>-labeler","description":"Applies ML-based labels, classifications, bounding boxes, or entity tags to work items.","skills":{"pilot-task-router":"Route work items to appropriate ML models by data type.","pilot-dataset":"Store and retrieve labeled data records.","pilot-metrics":"Track labeling throughput, model confidence distributions."},"peers":[{"role":"ingester","hostname":"<prefix>-ingester","description":"Sends work items for labeling"},{"role":"reviewer","hostname":"<prefix>-reviewer","description":"Receives labeled items for quality review"}],"data_flows":[{"direction":"receive","peer":"<prefix>-ingester","port":1002,"topic":"work-item","description":"Work items with raw data references"},{"direction":"send","peer":"<prefix>-reviewer","port":1002,"topic":"labeled-item","description":"Labeled items for quality review"},{"direction":"receive","peer":"<prefix>-reviewer","port":1002,"topic":"review-feedback","description":"Feedback on rejected labels for re-labeling"}],"handshakes_needed":["<prefix>-ingester","<prefix>-reviewer"]}

reviewer

{"setup":"data-labeling-pipeline","setup_name":"Data Labeling Pipeline","role":"reviewer","role_name":"Quality Reviewer","hostname":"<prefix>-reviewer","description":"Samples labeled items, checks accuracy, flags disagreements, computes inter-annotator agreement.","skills":{"pilot-review":"Score labeled items against quality criteria and flag disagreements.","pilot-event-filter":"Filter low-confidence labels for priority review.","pilot-alert":"Alert on quality drops or inter-annotator agreement below threshold."},"peers":[{"role":"labeler","hostname":"<prefix>-labeler","description":"Sends labeled items for review"},{"role":"exporter","hostname":"<prefix>-exporter","description":"Receives approved labels for export"}],"data_flows":[{"direction":"receive","peer":"<prefix>-labeler","port":1002,"topic":"labeled-item","description":"Labeled items for quality review"},{"direction":"send","peer":"<prefix>-labeler","port":1002,"topic":"review-feedback","description":"Feedback for re-labeling rejected items"},{"direction":"send","peer":"<prefix>-exporter","port":1002,"topic":"approved-label","description":"Approved labels ready for packaging"}],"handshakes_needed":["<prefix>-labeler","<prefix>-exporter"]}

exporter

{"setup":"data-labeling-pipeline","setup_name":"Data Labeling Pipeline","role":"exporter","role_name":"Dataset Exporter","hostname":"<prefix>-exporter","description":"Packages reviewed labels into training-ready datasets (COCO, VOC, JSONL). Publishes to storage.","skills":{"pilot-dataset":"Assemble labeled items into structured dataset formats.","pilot-share":"Upload packaged datasets to S3 or shared storage.","pilot-webhook-bridge":"Notify downstream consumers when datasets are published."},"peers":[{"role":"reviewer","hostname":"<prefix>-reviewer","description":"Sends approved labels for packaging"}],"data_flows":[{"direction":"receive","peer":"<prefix>-reviewer","port":1002,"topic":"approved-label","description":"Approved labels ready for packaging"},{"direction":"send","peer":"external","port":443,"topic":"dataset-published","description":"Notification that a new dataset is available"}],"handshakes_needed":["<prefix>-reviewer"]}

Data Flows

  • ingester -> labeler : work-item events (port 1002)
  • labeler -> reviewer : labeled-item events (port 1002)
  • reviewer -> labeler : review-feedback events (port 1002)
  • reviewer -> exporter : approved-label events (port 1002)
  • exporter -> external : dataset-published notifications (port 443)

Handshakes

# ingester <-> labeler:
pilotctl --json handshake <prefix>-labeler "setup: data-labeling-pipeline"
pilotctl --json handshake <prefix>-ingester "setup: data-labeling-pipeline"

# labeler <-> reviewer:
pilotctl --json handshake <prefix>-reviewer "setup: data-labeling-pipeline"
pilotctl --json handshake <prefix>-labeler "setup: data-labeling-pipeline"

# reviewer <-> exporter:
pilotctl --json handshake <prefix>-exporter "setup: data-labeling-pipeline"
pilotctl --json handshake <prefix>-reviewer "setup: data-labeling-pipeline"

Workflow Example

# On labeler — subscribe to work items:
pilotctl --json subscribe <prefix>-ingester work-item

# On ingester — publish a work item:
pilotctl --json publish <prefix>-labeler work-item '{"batch_id":"batch-042","item_id":"img-0017","type":"image","s3_uri":"s3://raw-data/batch-042/img-0017.jpg"}'

# On reviewer — subscribe to labeled items:
pilotctl --json subscribe <prefix>-labeler labeled-item

# On exporter — subscribe to approved labels:
pilotctl --json subscribe <prefix>-reviewer approved-label

Dependencies

Requires pilot-protocol skill, pilotctl binary, clawhub binary, and a running daemon.