corporate-doc-builder

Dev Tools

Generate polished .docx documents by injecting Markdown content into an existing Word template, preserving the template's cover page, TOC, fonts, headers, and footers. Use when the user has a .docx template plus reference materials (documents, spreadsheets, slides, or source code) and wants a production-ready Word deliverable. Covers the full pipeline from template analysis through chapter drafting to python-docx injection.

Install

openclaw skills install corporate-doc-builder

Corporate Doc Builder

Overview

Turning a corporate .docx template plus scattered source materials into a finished document is error-prone. Models routinely break template styles, exceed token limits, hallucinate TOC entries, and produce images that overlap with text.

This skill codifies a battle-tested 6-stage pipeline that avoids these traps:

1. Template Analysis  ->  Extract TOC, styles, placeholders
2. Spec               ->  Write a design spec for the document
3. Plan               ->  Break work into per-chapter tasks (optional for simple docs)
4. Research           ->  Summarize source materials into reusable notes
5. Authoring          ->  Write each chapter as an independent Markdown file
6. Injection          ->  Render diagrams + inject Markdown into the template

Core principle: Template styles are preserved via python-docx copy-and-inject, never via pandoc whole-file conversion. All diagrams use Mermaid.

When to Trigger

Activate this skill when ANY of the following apply:

The user asks to "write a document based on a template" or "generate a report from a template"
A task mentions both a .docx template path AND source/reference materials
The user requires "cover page / TOC / fonts / headers / footers must match the template"
The user asks to produce a corporate design document (outline design, detailed design, database design, interface design, architecture specification, technical white paper, etc.)
The task involves .docx template + .xlsx feature lists + .pptx architecture diagrams or similar mixed enterprise assets

Do NOT activate when:

The output is a plain Markdown, README, or blog post with no template
The user just wants to locally edit an existing .docx (use the docx skill instead)
No fixed template is involved

The 6-Stage Pipeline

Each stage has a pre-flight checklist and exit criteria. Do not skip stages or run them in parallel.

Stage 1: Template Analysis

Goal: Understand the template structure and agree on a TOC mapping with the user.

Companion skill: If superpowers:brainstorming is available, invoke it at the start of this stage to systematically explore user intent and requirements before committing to a TOC mapping.

Pre-flight Checklist

Item	Action
Output directory	`ls` to verify it exists; fix typos before proceeding
Source material paths	`ls` each path to confirm accessibility
Template file	`ls -la <template>.docx` to confirm it exists and is not locked
Historical drafts	If prior output exists, read the first chapter to verify it belongs to THIS document

Extract Template TOC

from docx import Document
doc = Document(template_path)
for p in doc.paragraphs:
    if p.style.name.startswith("Heading") or p.style.name.startswith("toc"):
        print(p.style.name, p.text)

Some templates use custom styles (e.g., CJ1, CJ2) instead of standard Heading styles. Scan all paragraph styles and identify which ones act as headings.

Extract Template Images and Tables

Templates often contain placeholder images and tables. Extract them to plan which chapters need diagrams or data tables:

# Count images
print(f"Images: {len(doc.inline_shapes)}")
# Count tables
print(f"Tables: {len(doc.tables)}")

TOC Mapping Rules

Templates often say "keep titles consistent." The real meaning is:

Top-level chapter titles (1, 2, 3, ...): Keep them exactly as the template defines.
Sub-section titles (1.1, 1.1.1, ...): Rewrite them to match the actual product/project. Do NOT copy the template's placeholder examples.
Style consistency: Match the template's tone (imperative verbs, clause-style statements, etc.)
Placeholder text: Replace ALL placeholder words (e.g., "XXX System", "Oracle Database", "SOA Architecture") with the actual technology stack and business domain.

Exit Criteria

TOC mapping table reviewed and confirmed by the user
Work approach decided (write from scratch / reuse prior drafts / partial reuse)
Output directory, source material whitelist, and module scope are all explicit

Stage 2: Spec

Goal: Write a design spec that anchors all subsequent work.

Write to <output>/spec/<YYYY-MM-DD>-<topic>-spec.md. Include at minimum:

Goal and scope
Source material constraints (whitelist of allowed paths)
Workflow overview
Complete TOC (user-confirmed)
Writing style baseline (language, depth, terminology)
Token budget protection strategy
Deliverables list
Confirmed key decisions
Open items

Self-check before submitting for review: scan for leftover placeholders, internal inconsistencies, scope creep, and ambiguity.

Exit Criteria

User has reviewed and approved the spec

Stage 3: Plan (Optional)

Goal: Break the work into per-chapter tasks for complex documents.

Skip this stage for simple documents (fewer than 5 chapters). For larger documents, write to <output>/plans/<YYYY-MM-DD>-<topic>-plan.md with tasks grouped into:

Research phase: 2-3 tasks (source code analysis, reference doc summary, feature mapping)
Authoring phase: One task per chapter
Injection phase: 2 tasks (Mermaid rendering, docx injection)

Each task should have bite-sized steps (2-5 minutes each). Per-chapter independent delivery + independent review is the key token budget protection mechanism.

Stage 4: Research

Goal: Extract and summarize source materials into reusable research notes.

Suggested output files in <output>/research/:

File	Content
`code-architecture.md`	Top-level module structure, key packages, tech stack, critical data flows
`reference-docs-summary.md`	Heading outline + key table/figure index for each reference document
`feature-mapping.md`	Feature list (from xlsx/pptx) mapped to target TOC chapters
`<topic>-inventory.md`	Domain-specific inventory (e.g., interface list, data model list, API catalog)

Summarization Principle

Reference documents are fact anchors, not content sources. Extract headings, table titles, and key data. Never copy full text into research notes.

Source Traceability Rule

Every TOC entry must have a traceable source (source code path, reference document section, or feature list row). If a TOC entry has no source, delete it from the TOC rather than drafting content without evidence.

Extraction Snippets

# Extract headings from .docx
from docx import Document
doc = Document(path)
for p in doc.paragraphs:
    if p.style.name.startswith("Heading"):
        print(p.style.name, p.text)

# Extract structured data from .xlsx
import openpyxl
wb = openpyxl.load_workbook(path)
for sh in wb.sheetnames:
    for row in wb[sh].iter_rows(values_only=True):
        print(row)

# Bulk-extract embedded images from .docx
# unzip -j <path>.docx 'word/media/*' -d ./extracted_imgs/

Exit Criteria

All research notes delivered and reviewed by the user
Every TOC entry has a source annotation

Stage 5: Authoring

Goal: Write each chapter as an independent Markdown file.

File Layout

<output>/<doc>_md/
  ch01_<topic>.md
  ch02_<topic>.md
  ch03_<topic>_p1.md      # Split large chapters into parts
  ch03_<topic>_p2.md
  ...
  chNN_<topic>.md
  appendix_a.md
  full_draft.md            # Final concatenation

Why Per-Chapter Files

Keeps each request within token limits
Enables per-chapter user review; problems surface early
Rewriting one chapter does not affect others

Mermaid Diagrams

Companion skill: If claude-mermaid:mermaid-diagrams is available, invoke it before writing Mermaid blocks. It provides syntax best practices, diagram type selection, and live preview tools that produce significantly higher-quality diagrams.

Use fenced ```mermaid code blocks in Markdown
Do not embed image placeholders; actual images are generated during injection
Complex diagrams (deployment, sequence) should be individually numbered for easy replacement
Do not hardcode colors or themes in Mermaid source; handle theming during rendering
sequenceDiagram does NOT support style directives; avoid them

Merge

cat ch01_*.md ch02_*.md ... chNN_*.md appendix_*.md > full_draft.md

After merging, review once for: TOC continuity, chapter numbering consistency, and Mermaid block count.

Exit Criteria

All chapters reviewed and approved by the user
full_draft.md created with correct chapter order

Stage 6: Injection

Goal: Render Mermaid diagrams to PNG, then inject Markdown into the template to produce the final .docx.

Step 1: Render Mermaid to PNG

python scripts/render_mermaid.py <full_draft.md> <images_dir>

This extracts all ```mermaid blocks and renders each to diagram_1.png, diagram_2.png, etc.

Step 2: Inject into Template

python scripts/inject_docx.py \
    --md-dir <markdown_dir> \
    --template <template.docx> \
    --output <output.docx> \
    --chapters ch01.md ch02.md ... appendix_a.md

The script: copies the template, clears body content after the TOC, injects Markdown as styled paragraphs, embeds Mermaid PNGs, and forces TOC field refresh.

Pre-Injection Template Style Audit

This is critical. Before running injection, check the template's paragraph styles for issues that will corrupt the output:

from docx import Document
doc = Document(template_path)
normal = doc.styles['Normal']
pf = normal.paragraph_format
print(f"Normal: line_spacing_rule={pf.line_spacing_rule}, line_spacing={pf.line_spacing}")
for style in doc.styles:
    if style.name and style.name.startswith("Heading"):
        pPr = style.element.find('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}pPr')
        if pPr is not None:
            numPr = pPr.find('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}numPr')
            if numPr is not None:
                print(f"WARNING: {style.name} has numPr (auto-numbering)")

Check for these three issues and apply fixes:

Issue	Symptom	Fix
Normal style has `line_spacing_rule = EXACTLY`	Images are clipped to line height and overlap with text	Override image paragraphs with `line_spacing_rule = SINGLE`
Heading styles have `numPr` elements	Double numbering: "1.1.1 2.1.1 Title"	Strip `numPr` from all Heading styles before injection
Non-Mermaid code blocks ignored	JSON/SQL/pseudocode blocks are blank in .docx	Render code blocks as shaded monospace paragraphs

See the Template Style Pitfalls section for details.

Exit Criteria

.docx opens correctly in Word/LibreOffice
Cover page, TOC, headers, footers match the template
All images display correctly with no text overlap
All code blocks are rendered as monospace shaded paragraphs
TOC updates correctly when refreshed (Ctrl+A, F9 in Word)

Template Style Pitfalls

These issues were discovered across 4 production document generations. They are universal to any .docx template injection workflow.

Pitfall 1: Image Clipping from Fixed Line Spacing

Root cause: Many corporate templates set the Normal paragraph style to line_spacing_rule = EXACTLY with a fixed height (e.g., 26pt). When an image is inserted into a paragraph inheriting this style, the paragraph height is locked to 26pt regardless of image size. The image overflows and overlaps subsequent text.

Fix: Explicitly set line_spacing_rule = SINGLE on every image paragraph:

from docx.enum.text import WD_LINE_SPACING
from docx.shared import Pt, Cm

def add_image(doc, img_path, max_w_cm=14.0, max_h_cm=12.0):
    p = doc.add_paragraph()
    p.alignment = WD_ALIGN_PARAGRAPH.CENTER
    pf = p.paragraph_format
    pf.space_before = Pt(6)
    pf.space_after = Pt(6)
    pf.line_spacing_rule = WD_LINE_SPACING.SINGLE  # Override EXACTLY
    run = p.add_run()
    w_cm, h_cm = image_size_cm(img_path, max_w_cm, max_h_cm)
    run.add_picture(img_path, width=Cm(w_cm), height=Cm(h_cm))

Also cap max_h_cm at 12 (not 18) to prevent a single image from filling the entire page.

Pitfall 2: Double Numbering from Heading numPr

Root cause: Some templates configure Heading styles with numPr (automatic numbering at the style level). When the Markdown heading text already contains manual numbering (e.g., "2.1.1 System Architecture"), the output shows "1.1.1 2.1.1 System Architecture" - the style's auto-number prepended to the manual number.

Fix: Strip numPr from all Heading styles before injecting content:

WNS = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'

def strip_heading_auto_numbering(doc):
    for style in doc.styles:
        if style.name and style.name.startswith("Heading"):
            pPr = style.element.find(f'{WNS}pPr')
            if pPr is not None:
                numPr = pPr.find(f'{WNS}numPr')
                if numPr is not None:
                    pPr.remove(numPr)

Pitfall 3: Missing Code Blocks

Root cause: Injection scripts that only handle Mermaid fenced blocks often skip other code blocks (JSON, SQL, pseudocode, curl examples), leaving blank spaces in the output.

Fix: Collect non-Mermaid code block lines and render them as shaded monospace paragraphs:

from docx.shared import Pt, RGBColor
from docx.oxml.ns import qn

def add_code_block(doc, lines):
    code_text = "\n".join(lines)
    p = doc.add_paragraph()
    pPr = p._element.get_or_add_pPr()
    shd = docx.oxml.OxmlElement('w:shd')
    shd.set(qn('w:val'), 'clear')
    shd.set(qn('w:color'), 'auto')
    shd.set(qn('w:fill'), 'F2F2F2')
    pPr.append(shd)
    run = p.add_run(code_text)
    run.font.name = "Consolas"
    run.font.size = Pt(9)
    run.font.color.rgb = RGBColor(0x33, 0x33, 0x33)

Pitfall 4: Template Placeholder Text Leaking

Root cause: Template headers, footers, and cover pages contain placeholder text ("XXX Project", "XXXX System"). If not replaced, the output ships with the wrong project name.

Fix: Scan and replace header/footer text during the injection step:

for section in doc.sections:
    for header_para in section.header.paragraphs:
        for run in header_para.runs:
            if "XXX" in run.text:
                run.text = run.text.replace("XXX", actual_project_name)

Token Budget Protection

LLM context windows have hard limits. These strategies prevent token overflow during document generation:

Strategy	Stage
Per-chapter independent Markdown files	Authoring
Research notes are summaries, not full-text copies	Research
Read source code on demand (`ls` + `Read`), never dump entire directories	Research
Compress long lists into tables	All stages
Per-chapter review checkpoints	Authoring
Never load all chapters into a single request	Authoring / Injection

Common Pitfalls

Symptom	Root Cause	Fix
Path typos or directories do not exist	No pre-flight path validation	`ls` every path in the whitelist before starting
Wrong draft used as starting point	Did not verify which document a prior draft belongs to	Read the first chapter to confirm the topic
Template placeholder text appears in output	Treated "keep titles consistent" as "keep content identical"	Keep top-level titles; rewrite sub-sections for the actual project
Chapter organization does not match the product	Organized by code modules instead of user-facing capabilities	Organize by product capability, not by engineering repo structure
Token limit errors	Too many chapters loaded at once	Per-chapter files + summarized research
pandoc destroys template fonts/headers/footers	Used pandoc instead of python-docx	Always use python-docx template copy + injection
Reference doc text copied verbatim into chapters	Treated source material as content rather than fact anchors	Research phase produces summaries only
TOC entries have no source evidence	Concept-level headings imported without code/doc backing	Research phase: annotate every entry with a source; delete unsupported entries
"1.1.1 2.1.1 Title" double numbering	Template Heading styles have `numPr` auto-numbering	Strip `numPr` before injection
JSON/SQL/pseudocode blocks are blank in .docx	Injection script skips non-Mermaid code blocks	Render code blocks as shaded monospace paragraphs
Images overlap with text or are clipped	Template Normal style uses `EXACTLY` line spacing	Set image paragraph `line_spacing_rule = SINGLE`; cap `max_h_cm` at 12

Reference Implementation

This skill includes ready-to-use Python scripts in the scripts/ directory:

`scripts/render_mermaid.py`

Extracts all ```mermaid blocks from a Markdown file and renders each to diagram_N.png using mmdc (Mermaid CLI).

python scripts/render_mermaid.py <markdown_file> <output_image_dir>

Requirements: npx (Node.js), which auto-installs @mermaid-js/mermaid-cli.

`scripts/inject_docx.py`

Copies a .docx template, clears the body after the TOC, and injects Markdown content as properly styled Word elements.

python scripts/inject_docx.py \
    --md-dir ./output/chapters_md \
    --template ./templates/design_spec.docx \
    --output ./output/design_spec.docx \
    --chapters ch01.md ch02.md ch03.md appendix_a.md \
    --header-replace "XXX=My Project Name"

Features:

Heading injection (levels 1-3)
Markdown table to Word table conversion
Bold and inline code formatting
Mermaid PNG image embedding with correct sizing
Non-Mermaid code block rendering (shaded monospace)
Heading numPr auto-numbering removal
Image paragraph SINGLE line spacing (prevents clipping)
TOC field auto-refresh on open
Optional header/footer text replacement

Requirements: python-docx, Pillow.

`scripts/puppeteer-config.json`

Disables Chromium sandboxing for mmdc in Linux/container environments:

{ "args": ["--no-sandbox"] }

Companion Skills (Optional Enhancements)

This skill is fully self-contained — it works without any companion skills installed. However, if the following skills are available in your environment, they significantly improve specific stages:

Skill	Stage	Benefit
`claude-mermaid:mermaid-diagrams`	Stage 5 (Authoring)	Provides Mermaid syntax best practices, diagram type selection guidance, and live preview/save tools (`mermaid_preview` / `mermaid_save`). Produces higher-quality diagrams than writing Mermaid from scratch.
`superpowers:brainstorming`	Stage 1 (Template Analysis)	Structured brainstorming workflow that explores user intent, requirements, and design alternatives before committing to a TOC mapping. Reduces rework.
`superpowers:writing-plans`	Stage 3 (Plan)	Structured planning workflow for multi-step implementation tasks. Helps break complex documents into well-scoped per-chapter tasks.

How to use them: If a companion skill is available, invoke it via the Skill tool at the relevant stage. If it is not available, follow the inline guidance in this skill — the core instructions for each stage already cover the essential techniques.

Example: During Stage 5, if claude-mermaid:mermaid-diagrams is installed, invoke it before writing Mermaid blocks. If not, follow the Mermaid guidelines in the Authoring section directly.

Pre-Flight Checklist

Use this checklist when starting any new document:

corporate-doc-builder

Install

Corporate Doc Builder

Overview

When to Trigger

The 6-Stage Pipeline

Stage 1: Template Analysis

Pre-flight Checklist

Extract Template TOC

Extract Template Images and Tables

TOC Mapping Rules

Exit Criteria

Stage 2: Spec

Exit Criteria

Stage 3: Plan (Optional)

Stage 4: Research

Summarization Principle

Source Traceability Rule

Extraction Snippets

Exit Criteria

Stage 5: Authoring

File Layout

Why Per-Chapter Files

Mermaid Diagrams

Merge

Exit Criteria

Stage 6: Injection

Step 1: Render Mermaid to PNG

Step 2: Inject into Template

Pre-Injection Template Style Audit

Exit Criteria

Template Style Pitfalls

Pitfall 1: Image Clipping from Fixed Line Spacing

Pitfall 2: Double Numbering from Heading numPr

Pitfall 3: Missing Code Blocks

Pitfall 4: Template Placeholder Text Leaking

Token Budget Protection

Common Pitfalls

Reference Implementation

scripts/render_mermaid.py

scripts/inject_docx.py

scripts/puppeteer-config.json

Companion Skills (Optional Enhancements)

Pre-Flight Checklist

Related skills

`scripts/render_mermaid.py`

`scripts/inject_docx.py`

`scripts/puppeteer-config.json`