Install
openclaw skills install data-construction-skillbuild concept, process, and case-application supervision datasets from markdown books or long markdown documents. use when generating training data from many...
openclaw skills install data-construction-skillBuild supervision datasets from markdown books or long markdown documents.
Books are knowledge sources only. The final dataset must teach reusable domain knowledge and how to apply it. Do not generate book-comprehension questions, citation-led questions, or document-structure questions.
Default behavior is full coverage, not sampling.
Compile book knowledge into three complementary supervision forms:
concept_qa: teach atomic reusable knowledge such as definitions, categories, rules, mechanisms, purposes, and constraints.process_qa: teach concise, grounded reasoning patterns such as condition checking, rule application, causal explanation, comparison, exception handling, and step ordering.case_application: teach knowledge transfer into realistic but source-grounded scenarios where the model must analyze a situation and apply the book's knowledge.Use all three forms when supported. Do not force all three forms for every chunk.
Use either of these inputs:
*.chunks.jsonl files.If chunk files already exist, reuse them instead of re-splitting the source books.
A task is complete only when every chunk has exactly one of the following outcomes:
kept record written to chunk_status.jsonl, orskipped decision in chunk_status.jsonl with a non-empty skip_reasonDo not stop after producing a small sample unless the user explicitly asks for a sample.
Do not report the task as completed, finished, done, or ready until all of the following are true:
check_coverage.py reports unprocessed_chunks = 0sample_without_status_preview is emptysample_status_mismatch_preview is emptyPartial progress may be reported only as progress, never as completion.
Use a resumable work layout like this:
work/
manifest.jsonl
chunks/
book_a.chunks.jsonl
supervision_batches/
batch_001.jsonl
batch_002.jsonl
chunk_status.jsonl
supervision_merged.jsonl
validation.json
coverage.json
chunk_status.jsonl is required for full runs.
If the user provides markdown files, run:
scripts/build_manifest.py <input_dir> --output work/manifest.jsonl
scripts/split_markdown_book.py <input_md> --output work/chunks/<name>.chunks.jsonl --source-root <input_dir>
If chunk files already exist, skip this step.
Process chunk files sequentially in small batches.
Recommended batch size: 20 to 50 chunks.
Use:
scripts/next_unprocessed_chunks.py work/chunks/*.chunks.jsonl --status work/chunk_status.jsonl --limit 25 --output work/next_batch.jsonl
For each chunk in the batch, do the following in order:
skipped record to chunk_status.jsonl.concept_qa, process_qa, case_application.kept status record for that chunk with per-type counts.After finishing one batch, continue with the next unprocessed batch until no chunks remain.
After every processed batch:
Never leave supervision records without status records.
Never mark a chunk as kept unless at least one supervision record was actually written for that chunk.
Never leave a processed chunk without a status record.
If a chunk contains no teachable knowledge, generate 0 samples and record a skip reason.
Use only these values for skip_reason:
Do not invent new skip labels.
Before drafting samples, identify the knowledge taught by the chunk.
A knowledge proposition is a distinct reusable statement the model should learn.
Typical proposition types:
If the chunk does not support clear propositions, skip it.
For each chunk kept for supervision, identify any proposition relations that are explicitly supported or can be derived in one grounded step from the chunk:
These relations determine whether the chunk can support process or case supervision. Do not fabricate relations not supported by the source.
Transform source statements into concept-level knowledge.
Remove:
The samples must ask about the concept or application itself, not the document.
concept_qa when the chunk supports atomic knowledge such as:process_qa when the chunk supports concise grounded reasoning such as:case_application when the chunk supports scenario reframing such as:Do not force process or case samples from chunks that only support atomic knowledge.
Do not impose a fixed upper limit on sample count per chunk.
The goal is to exhaust the chunk’s reusable knowledge propositions and supported reasoning patterns.
If a chunk teaches five distinct reusable propositions, generate supervision for all five.
If a chunk teaches ten distinct reusable propositions, generate supervision for all ten.
Do not stop early just because the chunk already has “enough” items.
However, exhaustiveness means exhausting distinct knowledge and reasoning patterns, not generating paraphrase variants.
Generate all distinct, supportable, reusable propositions and applications in the chunk, but do not ask multiple questions that test the same proposition with only wording changes.
Prefer proposition coverage over superficial sample count.
Exhaustiveness includes:
Exhaustiveness does not include:
Default to generating supervision from the current chunk alone.
If adjacent chunks belong to the same concept and one chunk alone is insufficient for a clean conceptual or process sample, generate a sample anchored to the primary chunk and optionally record supporting chunk ids in metadata.
Do not merge distant chunks or broad chapter themes into one item.
Skip the chunk instead of generating supervision when:
Exhaustive coverage does not justify weak samples.
Reasoning in this dataset is external supervision, not hidden chain-of-thought.
Use short, explicit, domain-grounded reasoning steps that teach a reusable decision pattern. Keep them concise and factual.
Good reasoning characteristics:
Bad reasoning characteristics:
concept_qaUse when teaching reusable knowledge directly.
Question should stand alone.
Answer should:
process_qaUse when teaching how to reason with the knowledge.
Question should require applying a source-supported rule, condition, comparison, sequence, or exception.
Reasoning should:
Answer should be brief and directly resolve the question.
case_applicationUse when the chunk supports scenario transfer without hallucination.
Case should:
Analysis should:
Answer should resolve the case directly.
Avoid phrases such as:
Questions must stand alone.
Avoid questions framed around:
Ask about the concept instead.
Avoid answers or reasoning such as:
Answers and reasoning must provide knowledge, not instructions about answering.
Never generate samples from:
Never invent:
If the chunk does not support a clean grounded reasoning path, emit concept_qa only.
Use only these question_type values:
Use singular labels exactly as written above.
Write one JSON object per line.
Required fields for all sample types:
{
"sample_type": "concept_qa",
"source_file": "...",
"chunk_id": "..."
}
concept_qa{
"sample_type": "concept_qa",
"question": "...",
"answer": "...",
"source_file": "...",
"chunk_id": "...",
"question_type": "definition",
"metadata": {
"knowledge_point": "...",
"supporting_chunk_ids": []
}
}
process_qa{
"sample_type": "process_qa",
"question": "...",
"reasoning": [
"...",
"..."
],
"answer": "...",
"source_file": "...",
"chunk_id": "...",
"question_type": "rule",
"metadata": {
"knowledge_points": ["..."],
"reasoning_pattern": "rule_application",
"supporting_chunk_ids": []
}
}
case_application{
"sample_type": "case_application",
"case": "...",
"question": "...",
"analysis": [
"...",
"..."
],
"answer": "...",
"source_file": "...",
"chunk_id": "...",
"question_type": "condition",
"metadata": {
"knowledge_points": ["..."],
"task_form": "case_analysis",
"supporting_chunk_ids": []
}
}
Use metadata.knowledge_point or metadata.knowledge_points when it helps identify the canonical concept being taught.
Use metadata.supporting_chunk_ids only when adjacent chunks are genuinely needed.
Write one JSON object per line to chunk_status.jsonl.
For kept chunks:
{
"chunk_id": "...",
"source_file": "...",
"status": "kept",
"skip_reason": "",
"concept_count": 2,
"process_count": 1,
"case_count": 1,
"total_sample_count": 4
}
For skipped chunks:
{
"chunk_id": "...",
"source_file": "...",
"status": "skipped",
"skip_reason": "navigation",
"concept_count": 0,
"process_count": 0,
"case_count": 0,
"total_sample_count": 0
}
There must be exactly one final status record per processed chunk.
After each completed batch or after merging batches, run:
scripts/validate_qa_jsonl.py work/supervision_merged.jsonl --report work/validation.json
scripts/check_coverage.py work/chunks/*.chunks.jsonl --status work/chunk_status.jsonl --qa work/supervision_merged.jsonl --report work/coverage.json
If coverage reports any of the following, the run is not complete:
unprocessed_chunks > 0sample_without_status_previewsample_status_mismatch_previewIf validation passes but coverage is incomplete, continue processing remaining chunks.
The final dataset should read like a general domain-supervision corpus, not a book comprehension exercise.
The objective is exhaustive coverage of reusable knowledge propositions across the corpus, plus grounded reasoning patterns and case application whenever the source supports them, with zero tolerance for structural leakage, fake reasoning, status inconsistency, or paraphrase-only duplication.