corporate-doc-builder

Dev Tools

Generate polished .docx documents by injecting Markdown content into an existing Word template, preserving the template's cover page, TOC, fonts, headers, and footers. Use when the user has a .docx template plus reference materials (documents, spreadsheets, slides, or source code) and wants a production-ready Word deliverable. Covers the full pipeline from template analysis through chapter drafting to python-docx injection.

Install

openclaw skills install corporate-doc-builder

Corporate Doc Builder

Overview

Turning a corporate .docx template plus scattered source materials into a finished document is error-prone. Models routinely break template styles, exceed token limits, hallucinate TOC entries, and produce images that overlap with text.

This skill codifies a battle-tested 6-stage pipeline that avoids these traps:

1. Template Analysis  ->  Extract TOC, styles, placeholders
2. Spec               ->  Write a design spec for the document
3. Plan               ->  Break work into per-chapter tasks (optional for simple docs)
4. Research           ->  Summarize source materials into reusable notes
5. Authoring          ->  Write each chapter as an independent Markdown file
6. Injection          ->  Render diagrams + inject Markdown into the template

Core principle: Template styles are preserved via python-docx copy-and-inject, never via pandoc whole-file conversion. All diagrams use Mermaid.

When to Trigger

Activate this skill when ANY of the following apply:

  • The user asks to "write a document based on a template" or "generate a report from a template"
  • A task mentions both a .docx template path AND source/reference materials
  • The user requires "cover page / TOC / fonts / headers / footers must match the template"
  • The user asks to produce a corporate design document (outline design, detailed design, database design, interface design, architecture specification, technical white paper, etc.)
  • The task involves .docx template + .xlsx feature lists + .pptx architecture diagrams or similar mixed enterprise assets

Do NOT activate when:

  • The output is a plain Markdown, README, or blog post with no template
  • The user just wants to locally edit an existing .docx (use the docx skill instead)
  • No fixed template is involved

The 6-Stage Pipeline

Each stage has a pre-flight checklist and exit criteria. Do not skip stages or run them in parallel.


Stage 1: Template Analysis

Goal: Understand the template structure and agree on a TOC mapping with the user.

Companion skill: If superpowers:brainstorming is available, invoke it at the start of this stage to systematically explore user intent and requirements before committing to a TOC mapping.

Pre-flight Checklist

ItemAction
Output directoryls to verify it exists; fix typos before proceeding
Source material pathsls each path to confirm accessibility
Template filels -la <template>.docx to confirm it exists and is not locked
Historical draftsIf prior output exists, read the first chapter to verify it belongs to THIS document

Extract Template TOC

from docx import Document
doc = Document(template_path)
for p in doc.paragraphs:
    if p.style.name.startswith("Heading") or p.style.name.startswith("toc"):
        print(p.style.name, p.text)

Some templates use custom styles (e.g., CJ1, CJ2) instead of standard Heading styles. Scan all paragraph styles and identify which ones act as headings.

Extract Template Images and Tables

Templates often contain placeholder images and tables. Extract them to plan which chapters need diagrams or data tables:

# Count images
print(f"Images: {len(doc.inline_shapes)}")
# Count tables
print(f"Tables: {len(doc.tables)}")

TOC Mapping Rules

Templates often say "keep titles consistent." The real meaning is:

  • Top-level chapter titles (1, 2, 3, ...): Keep them exactly as the template defines.
  • Sub-section titles (1.1, 1.1.1, ...): Rewrite them to match the actual product/project. Do NOT copy the template's placeholder examples.
  • Style consistency: Match the template's tone (imperative verbs, clause-style statements, etc.)
  • Placeholder text: Replace ALL placeholder words (e.g., "XXX System", "Oracle Database", "SOA Architecture") with the actual technology stack and business domain.

Exit Criteria

  • TOC mapping table reviewed and confirmed by the user
  • Work approach decided (write from scratch / reuse prior drafts / partial reuse)
  • Output directory, source material whitelist, and module scope are all explicit

Stage 2: Spec

Goal: Write a design spec that anchors all subsequent work.

Write to <output>/spec/<YYYY-MM-DD>-<topic>-spec.md. Include at minimum:

  1. Goal and scope
  2. Source material constraints (whitelist of allowed paths)
  3. Workflow overview
  4. Complete TOC (user-confirmed)
  5. Writing style baseline (language, depth, terminology)
  6. Token budget protection strategy
  7. Deliverables list
  8. Confirmed key decisions
  9. Open items

Self-check before submitting for review: scan for leftover placeholders, internal inconsistencies, scope creep, and ambiguity.

Exit Criteria

  • User has reviewed and approved the spec

Stage 3: Plan (Optional)

Goal: Break the work into per-chapter tasks for complex documents.

Skip this stage for simple documents (fewer than 5 chapters). For larger documents, write to <output>/plans/<YYYY-MM-DD>-<topic>-plan.md with tasks grouped into:

  • Research phase: 2-3 tasks (source code analysis, reference doc summary, feature mapping)
  • Authoring phase: One task per chapter
  • Injection phase: 2 tasks (Mermaid rendering, docx injection)

Each task should have bite-sized steps (2-5 minutes each). Per-chapter independent delivery + independent review is the key token budget protection mechanism.


Stage 4: Research

Goal: Extract and summarize source materials into reusable research notes.

Suggested output files in <output>/research/:

FileContent
code-architecture.mdTop-level module structure, key packages, tech stack, critical data flows
reference-docs-summary.mdHeading outline + key table/figure index for each reference document
feature-mapping.mdFeature list (from xlsx/pptx) mapped to target TOC chapters
<topic>-inventory.mdDomain-specific inventory (e.g., interface list, data model list, API catalog)

Summarization Principle

Reference documents are fact anchors, not content sources. Extract headings, table titles, and key data. Never copy full text into research notes.

Source Traceability Rule

Every TOC entry must have a traceable source (source code path, reference document section, or feature list row). If a TOC entry has no source, delete it from the TOC rather than drafting content without evidence.

Extraction Snippets

# Extract headings from .docx
from docx import Document
doc = Document(path)
for p in doc.paragraphs:
    if p.style.name.startswith("Heading"):
        print(p.style.name, p.text)

# Extract structured data from .xlsx
import openpyxl
wb = openpyxl.load_workbook(path)
for sh in wb.sheetnames:
    for row in wb[sh].iter_rows(values_only=True):
        print(row)

# Bulk-extract embedded images from .docx
# unzip -j <path>.docx 'word/media/*' -d ./extracted_imgs/

Exit Criteria

  • All research notes delivered and reviewed by the user
  • Every TOC entry has a source annotation

Stage 5: Authoring

Goal: Write each chapter as an independent Markdown file.

File Layout

<output>/<doc>_md/
  ch01_<topic>.md
  ch02_<topic>.md
  ch03_<topic>_p1.md      # Split large chapters into parts
  ch03_<topic>_p2.md
  ...
  chNN_<topic>.md
  appendix_a.md
  full_draft.md            # Final concatenation

Why Per-Chapter Files

  • Keeps each request within token limits
  • Enables per-chapter user review; problems surface early
  • Rewriting one chapter does not affect others

Mermaid Diagrams

Companion skill: If claude-mermaid:mermaid-diagrams is available, invoke it before writing Mermaid blocks. It provides syntax best practices, diagram type selection, and live preview tools that produce significantly higher-quality diagrams.

  • Use fenced ```mermaid code blocks in Markdown
  • Do not embed image placeholders; actual images are generated during injection
  • Complex diagrams (deployment, sequence) should be individually numbered for easy replacement
  • Do not hardcode colors or themes in Mermaid source; handle theming during rendering
  • sequenceDiagram does NOT support style directives; avoid them

Merge

cat ch01_*.md ch02_*.md ... chNN_*.md appendix_*.md > full_draft.md

After merging, review once for: TOC continuity, chapter numbering consistency, and Mermaid block count.

Exit Criteria

  • All chapters reviewed and approved by the user
  • full_draft.md created with correct chapter order

Stage 6: Injection

Goal: Render Mermaid diagrams to PNG, then inject Markdown into the template to produce the final .docx.

Step 1: Render Mermaid to PNG

python scripts/render_mermaid.py <full_draft.md> <images_dir>

This extracts all ```mermaid blocks and renders each to diagram_1.png, diagram_2.png, etc.

Step 2: Inject into Template

python scripts/inject_docx.py \
    --md-dir <markdown_dir> \
    --template <template.docx> \
    --output <output.docx> \
    --chapters ch01.md ch02.md ... appendix_a.md

The script: copies the template, clears body content after the TOC, injects Markdown as styled paragraphs, embeds Mermaid PNGs, and forces TOC field refresh.

Pre-Injection Template Style Audit

This is critical. Before running injection, check the template's paragraph styles for issues that will corrupt the output:

from docx import Document
doc = Document(template_path)
normal = doc.styles['Normal']
pf = normal.paragraph_format
print(f"Normal: line_spacing_rule={pf.line_spacing_rule}, line_spacing={pf.line_spacing}")
for style in doc.styles:
    if style.name and style.name.startswith("Heading"):
        pPr = style.element.find('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}pPr')
        if pPr is not None:
            numPr = pPr.find('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}numPr')
            if numPr is not None:
                print(f"WARNING: {style.name} has numPr (auto-numbering)")

Check for these three issues and apply fixes:

IssueSymptomFix
Normal style has line_spacing_rule = EXACTLYImages are clipped to line height and overlap with textOverride image paragraphs with line_spacing_rule = SINGLE
Heading styles have numPr elementsDouble numbering: "1.1.1 2.1.1 Title"Strip numPr from all Heading styles before injection
Non-Mermaid code blocks ignoredJSON/SQL/pseudocode blocks are blank in .docxRender code blocks as shaded monospace paragraphs

See the Template Style Pitfalls section for details.

Exit Criteria

  • .docx opens correctly in Word/LibreOffice
  • Cover page, TOC, headers, footers match the template
  • All images display correctly with no text overlap
  • All code blocks are rendered as monospace shaded paragraphs
  • TOC updates correctly when refreshed (Ctrl+A, F9 in Word)

Template Style Pitfalls

These issues were discovered across 4 production document generations. They are universal to any .docx template injection workflow.

Pitfall 1: Image Clipping from Fixed Line Spacing

Root cause: Many corporate templates set the Normal paragraph style to line_spacing_rule = EXACTLY with a fixed height (e.g., 26pt). When an image is inserted into a paragraph inheriting this style, the paragraph height is locked to 26pt regardless of image size. The image overflows and overlaps subsequent text.

Fix: Explicitly set line_spacing_rule = SINGLE on every image paragraph:

from docx.enum.text import WD_LINE_SPACING
from docx.shared import Pt, Cm

def add_image(doc, img_path, max_w_cm=14.0, max_h_cm=12.0):
    p = doc.add_paragraph()
    p.alignment = WD_ALIGN_PARAGRAPH.CENTER
    pf = p.paragraph_format
    pf.space_before = Pt(6)
    pf.space_after = Pt(6)
    pf.line_spacing_rule = WD_LINE_SPACING.SINGLE  # Override EXACTLY
    run = p.add_run()
    w_cm, h_cm = image_size_cm(img_path, max_w_cm, max_h_cm)
    run.add_picture(img_path, width=Cm(w_cm), height=Cm(h_cm))

Also cap max_h_cm at 12 (not 18) to prevent a single image from filling the entire page.

Pitfall 2: Double Numbering from Heading numPr

Root cause: Some templates configure Heading styles with numPr (automatic numbering at the style level). When the Markdown heading text already contains manual numbering (e.g., "2.1.1 System Architecture"), the output shows "1.1.1 2.1.1 System Architecture" - the style's auto-number prepended to the manual number.

Fix: Strip numPr from all Heading styles before injecting content:

WNS = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'

def strip_heading_auto_numbering(doc):
    for style in doc.styles:
        if style.name and style.name.startswith("Heading"):
            pPr = style.element.find(f'{WNS}pPr')
            if pPr is not None:
                numPr = pPr.find(f'{WNS}numPr')
                if numPr is not None:
                    pPr.remove(numPr)

Pitfall 3: Missing Code Blocks

Root cause: Injection scripts that only handle Mermaid fenced blocks often skip other code blocks (JSON, SQL, pseudocode, curl examples), leaving blank spaces in the output.

Fix: Collect non-Mermaid code block lines and render them as shaded monospace paragraphs:

from docx.shared import Pt, RGBColor
from docx.oxml.ns import qn

def add_code_block(doc, lines):
    code_text = "\n".join(lines)
    p = doc.add_paragraph()
    pPr = p._element.get_or_add_pPr()
    shd = docx.oxml.OxmlElement('w:shd')
    shd.set(qn('w:val'), 'clear')
    shd.set(qn('w:color'), 'auto')
    shd.set(qn('w:fill'), 'F2F2F2')
    pPr.append(shd)
    run = p.add_run(code_text)
    run.font.name = "Consolas"
    run.font.size = Pt(9)
    run.font.color.rgb = RGBColor(0x33, 0x33, 0x33)

Pitfall 4: Template Placeholder Text Leaking

Root cause: Template headers, footers, and cover pages contain placeholder text ("XXX Project", "XXXX System"). If not replaced, the output ships with the wrong project name.

Fix: Scan and replace header/footer text during the injection step:

for section in doc.sections:
    for header_para in section.header.paragraphs:
        for run in header_para.runs:
            if "XXX" in run.text:
                run.text = run.text.replace("XXX", actual_project_name)

Token Budget Protection

LLM context windows have hard limits. These strategies prevent token overflow during document generation:

StrategyStage
Per-chapter independent Markdown filesAuthoring
Research notes are summaries, not full-text copiesResearch
Read source code on demand (ls + Read), never dump entire directoriesResearch
Compress long lists into tablesAll stages
Per-chapter review checkpointsAuthoring
Never load all chapters into a single requestAuthoring / Injection

Common Pitfalls

SymptomRoot CauseFix
Path typos or directories do not existNo pre-flight path validationls every path in the whitelist before starting
Wrong draft used as starting pointDid not verify which document a prior draft belongs toRead the first chapter to confirm the topic
Template placeholder text appears in outputTreated "keep titles consistent" as "keep content identical"Keep top-level titles; rewrite sub-sections for the actual project
Chapter organization does not match the productOrganized by code modules instead of user-facing capabilitiesOrganize by product capability, not by engineering repo structure
Token limit errorsToo many chapters loaded at oncePer-chapter files + summarized research
pandoc destroys template fonts/headers/footersUsed pandoc instead of python-docxAlways use python-docx template copy + injection
Reference doc text copied verbatim into chaptersTreated source material as content rather than fact anchorsResearch phase produces summaries only
TOC entries have no source evidenceConcept-level headings imported without code/doc backingResearch phase: annotate every entry with a source; delete unsupported entries
"1.1.1 2.1.1 Title" double numberingTemplate Heading styles have numPr auto-numberingStrip numPr before injection
JSON/SQL/pseudocode blocks are blank in .docxInjection script skips non-Mermaid code blocksRender code blocks as shaded monospace paragraphs
Images overlap with text or are clippedTemplate Normal style uses EXACTLY line spacingSet image paragraph line_spacing_rule = SINGLE; cap max_h_cm at 12

Reference Implementation

This skill includes ready-to-use Python scripts in the scripts/ directory:

scripts/render_mermaid.py

Extracts all ```mermaid blocks from a Markdown file and renders each to diagram_N.png using mmdc (Mermaid CLI).

python scripts/render_mermaid.py <markdown_file> <output_image_dir>

Requirements: npx (Node.js), which auto-installs @mermaid-js/mermaid-cli.

scripts/inject_docx.py

Copies a .docx template, clears the body after the TOC, and injects Markdown content as properly styled Word elements.

python scripts/inject_docx.py \
    --md-dir ./output/chapters_md \
    --template ./templates/design_spec.docx \
    --output ./output/design_spec.docx \
    --chapters ch01.md ch02.md ch03.md appendix_a.md \
    --header-replace "XXX=My Project Name"

Features:

  • Heading injection (levels 1-3)
  • Markdown table to Word table conversion
  • Bold and inline code formatting
  • Mermaid PNG image embedding with correct sizing
  • Non-Mermaid code block rendering (shaded monospace)
  • Heading numPr auto-numbering removal
  • Image paragraph SINGLE line spacing (prevents clipping)
  • TOC field auto-refresh on open
  • Optional header/footer text replacement

Requirements: python-docx, Pillow.

scripts/puppeteer-config.json

Disables Chromium sandboxing for mmdc in Linux/container environments:

{ "args": ["--no-sandbox"] }

Companion Skills (Optional Enhancements)

This skill is fully self-contained — it works without any companion skills installed. However, if the following skills are available in your environment, they significantly improve specific stages:

SkillStageBenefit
claude-mermaid:mermaid-diagramsStage 5 (Authoring)Provides Mermaid syntax best practices, diagram type selection guidance, and live preview/save tools (mermaid_preview / mermaid_save). Produces higher-quality diagrams than writing Mermaid from scratch.
superpowers:brainstormingStage 1 (Template Analysis)Structured brainstorming workflow that explores user intent, requirements, and design alternatives before committing to a TOC mapping. Reduces rework.
superpowers:writing-plansStage 3 (Plan)Structured planning workflow for multi-step implementation tasks. Helps break complex documents into well-scoped per-chapter tasks.

How to use them: If a companion skill is available, invoke it via the Skill tool at the relevant stage. If it is not available, follow the inline guidance in this skill — the core instructions for each stage already cover the essential techniques.

Example: During Stage 5, if claude-mermaid:mermaid-diagrams is installed, invoke it before writing Mermaid blocks. If not, follow the Mermaid guidelines in the Authoring section directly.


Pre-Flight Checklist

Use this checklist when starting any new document:

  • Verify output directory exists
  • Verify all source material paths are accessible
  • Verify template file exists and is not locked
  • Extract template TOC (headings + toc-styled paragraphs)
  • Extract template images and tables to plan per-chapter visuals
  • Audit template styles: check Normal line_spacing_rule and Heading numPr
  • Confirm TOC mapping with the user (top-level fixed, sub-sections adapted)
  • Write spec and get user approval
  • Complete research with source traceability for every TOC entry
  • Author each chapter as an independent Markdown file
  • Merge into full_draft.md and review
  • Render Mermaid diagrams to PNG
  • Run injection script
  • Open output .docx and verify: cover page, TOC refresh, image layout, code blocks