<!-- Derived from anthropics/financial-services-plugins under Apache-2.0. Extended by AIGroup for Chinese-market banker analysis with text-safety and data-provenance guardrails. -->
CN Client Investigation Skill
中国大陆客户调查分析 — 文字与数据双重保证
Use this skill when the target company is a China-market entity and the deliverable (MD / PPTX / Excel) must be free of Chinese character-level typos and based on verifiable Chinese-market data sources.
Why this skill exists
实测 2026-04-18 寒武纪 (Cambricon 688256.SH) 分析中,MiniMax-M2.7 在 pptxgenjs \uXXXX escape 序列里把"寒武纪"打成了"宽厭谛79",把"净利/财务/亏损"打成"洁利贜务贜损"。根因是中文罕见词的 token 级 BPE 切块 + 字符级错位。这个 skill 提供两道防线:
- Pre-computed Chinese lexicon:关键公司名 / 行业术语 / 财务词 先写死在本 skill 的 reference 文件里,agent 生成 slide JS 时直接
require('./references/cn-lexicon.js') 读取,不经过模型逐字符生成。
- UTF-8 literal over Unicode escape:JS 源码里
slide.addText("寒武纪科技") 而不是 slide.addText("\u5BD2\u6B66\u7EAA\u79D1\u6280")。literal 中文走 UTF-8 字节不经过 \uXXXX 解码路径,绕开 MiniMax 的主要 typo 通道。
Mandatory guardrails
Rule 1 — No \uXXXX escape for Chinese text (text accuracy)
In every slide-NN.js / markdown / Python code you emit, write Chinese characters as UTF-8 literals. NEVER encode Chinese as \uXXXX escape sequences.
// ✅ CORRECT — UTF-8 literal
slide.addText("寒武纪科技深度分析", { fontSize: 44, fontFace: "Microsoft YaHei" });
// ❌ WRONG — Unicode escape, MiniMax token-level typo risk
slide.addText("\u5BD2\u6B66\u7EAA\u79D1\u6280...", { ... });
All source files must be saved as UTF-8 no BOM. The write / edit tools accept raw Chinese fine; do not pre-encode to \uXXXX thinking it's "safer" — it is the opposite.
Rule 2 — Lexicon lookup before typing key terms (text accuracy)
Before emitting any of these classes of Chinese terms in code, look them up in references/cn-lexicon.js:
- Target company full name (e.g. 寒武纪科技、海光信息、摩尔线程智能科技)
- Section headers (公司概览、财务分析、竞争格局、估值分析、投资结论、投资亮点、风险分析)
- Financial line items (营业收入、归母净利润、扣非净利润、毛利率、净利率、研发费用、经营现金流)
- Investment recommendations (增持 / 中性 / 减持 / 买入 / 卖出 / 积极关注)
- Market / product terms (智算中心、AI加速、国产替代、CUDA兼容、实体清单、港交所递表)
If the term you need is not in the lexicon, add it to the lexicon first (one commit), then reference it. Don't type Chinese from memory in a \uXXXX form. See references/cn-lexicon.js.
Rule 3 — Cover page English primary, Chinese secondary (text safety shield)
Cover slides MUST use the English company name as 44pt hero title and Chinese name as ≤28pt subtitle. English ASCII has zero escape-typo risk; the Chinese subtitle is shorter and easier to sanity-check.
slide.addText("Cambricon Technologies", { fontSize: 44, fontFace: "Georgia", bold: true }); // 大字英文
slide.addText("寒武纪科技深度分析", { fontSize: 26, fontFace: "Microsoft YaHei" }); // 小字中文
Rule 4 — Data source hierarchy (data accuracy)
中国公司数据 MUST come from this tier, in order. Only drop to lower tier if higher tier returns empty / 402 / 403:
| Tier | Source | MCP tool / method |
|---|
| T1 — 交易所/监管披露 | 巨潮资讯 (cninfo.com.cn) 招股书 / 年报 / 季报 PDF 原文 | web_fetch on cninfo URL, parse PDF text |
| T1 — 盘面数据 | Tushare Pro | aigroup-market-mcp__company_performance, aigroup-market-mcp__stock_data, aigroup-market-mcp__basic_info |
| T2 — 公司公告 | 上交所 / 深交所 / 港交所官网 | web_fetch on sse/szse/hkex |
| T2 — 工商信息 | 天眼查 / 企查查(如 MCP 已装)/ 国家企业信用公示系统 | aigroup-tianyancha-mcp 或 web_search |
| T3 — 第三方数据 | Wind / 同花顺 / 东方财富 / FMP / Finnhub(港股/中概股) | aigroup-fmp-mcp, aigroup-finnhub-mcp |
| T4 — 公开报道 | 财新 / 21世纪 / 中新社 / 澎湃 / 财联社 | brave-web-search, web_fetch |
Rule 5 — Cross-check every hard number + provenance gate (data accuracy, MANDATORY)
Every financial number in the deliverable (营业收入 / 净利润 / 毛利率 / 市值 / 股价 / 融资金额) must be verified by at least 2 independent sources from the tier table above, OR clearly flagged as "single-source estimate" with the source cited in a page footer caption. If 2 sources diverge by > 5%, report both and pick the more recent; add a footnote.
Mandatory Phase 5 QA gate — every deliverable must pass provenance_verify.py before being considered shippable. The script scans the analysis markdown for hard numbers (digit + 亿/万/%/RMB/USD/元/CNY/HKD/M/B) and confirms every one of them has a matching row in the companion data-provenance.md tracking table. Missing provenance → exit 1 → block delivery.
python3 skills/cn-client-investigation/scripts/provenance_verify.py \
deliverable/analysis.md \
deliverable/data-provenance.md
Before shipping, also run in --strict mode. Strict adds two additional checks on top of the baseline substring match:
- Estimate-as-T1 smuggling — if a hard number in
analysis.md is adjacent to an estimate marker (~, 约, 大约, 估算, approximately, est., 粗估, 推算), the matching data-provenance.md row's source column MUST include at least one derivation keyword: [ESTIMATED] / [DERIVED] / 估算 / 推算 / derived / computed / analyst estimate. Otherwise the gate FAILs with the offending line number. This catches the common pattern where the agent writes ~22.9% but marks the provenance row with a T1 source like "Tushare Pro income_all" — which looked fine under the baseline substring check but is semantically misleading (the T1 source did not return 22.9%; the agent derived it).
- Precision drift (WARN, non-blocking) — when the same rounded integer + unit appears with multiple precisions in the analysis (e.g.
1.34 元/股 vs 1.340 元/股 vs 1.3 元/股), the gate emits a WARN so the agent can decide whether the differing precisions are intentional (e.g. pre vs post restatement with footnote) or a typo.
python3 skills/cn-client-investigation/scripts/provenance_verify.py --strict \
deliverable/analysis.md \
deliverable/data-provenance.md
Non-strict mode behavior is unchanged — --strict is additive and opt-in.
Every banker deliverable MUST include a data-provenance.md file at the deliverable root. Use the template under references/data-sources.md as the starting shape. Fill in one row per hard number with: 指标 / 数值 / 单位 / 期间 / Tier / 源 / URL 或工具 / 取数时间 / 交叉验证状态.
Rule 6 — No fabrication on missing data (data accuracy)
If a needed data point cannot be fetched (MCP returns error, web blocked, document inaccessible):
- DO NOT invent a plausible-looking number
- DO label the cell / chart / page-section as "数据不可得" / "N/A (source unavailable)" with a footnote explaining the attempted source
- DO proceed with the rest of the analysis — missing data doesn't block the deck
Historical micro_probit / panel_var_model style illustrative data is NOT appropriate for China banker deliverables — those tools belong to the lab bundle and produce demonstration output only.
Rule 7 — Self-verify deck text before delivery (typo detection, MANDATORY GATE)
Typo detection is NOT an optional step — it is a compile-time gate. A pptx that has not passed cn_typo_scan.py is NOT a shippable deliverable.
The canonical way to enforce this is to base slides/compile.js on the provided template:
references/compile_with_typo_gate.template.js.txt
Copy it to the deliverable's slides/compile.js (rename the .txt suffix off), adjust SLIDE_COUNT / OUTPUT_PATH / THEME at the top, then cd slides && node compile.js.
Why the .txt suffix in the plugin bundle: the template contains a child_process.spawnSync call to invoke the Python scanner, which OpenClaw's install-time safety scanner flags as a dangerous runtime pattern. Keeping the template as .js.txt under references/ tells the scanner this is documentation, not executable plugin code. At use time, you always copy it into your own deliverable's slides/ directory and strip the .txt — at that point it is your own script, outside the plugin trust boundary. The template:
- Standard pptxgenjs compile loop (require slide-01.js … slide-NN.js, call
createSlide(pres, theme), writeFile)
- Spawn
python3 with the skill's cn_typo_scan.py against the newly-written pptx's extracted text
- If scan exit is non-zero, node
process.exit(1) — the pptx is NOT considered delivered until the offending slide-NN.js files are fixed and the compile is re-run
If you cannot use the template verbatim (e.g. custom compile pipeline), you MUST still run the equivalent gate after every writeFile:
python3 -c "from pptx import Presentation; p = Presentation('deck.pptx'); [print(para.text) for s in p.slides for sh in s.shapes if sh.has_text_frame for para in sh.text_frame.paragraphs if para.text.strip()]" > /tmp/deck.txt
python3 skills/cn-client-investigation/scripts/cn_typo_scan.py /tmp/deck.txt # exit 0 = ship, exit 1 = abort
cn_typo_scan.py greps for these red-flag patterns (all observed in 2026-04-18 runs or confirmed on the broader \uXXXX token-drift pattern space):
- Rare character dyads that shouldn't appear in banker prose: 宽厭 / 谛数字 / 洁利 / 贜 / 校虚 / 催化济 / 棒品 / 转映 / 艺瑞 / 调诚
- Chinese chars immediately followed by digits (classic escape truncation symptom):
[一-龥][0-9]
- CJK Extension A / B / C / D characters (U+3400-U+4DBF, U+20000+) — almost always corruption in banker prose
On scan hit:
- Read the stderr report — each line gives L<n>, reason, and context snippet
- Identify the source
slide-NN.js file containing the offending text
- Replace the broken Unicode string with the UTF-8 literal fix (preferably via
LEXICON.red_zone.<key> lookup from references/cn-lexicon.js — Rule 2)
- Re-run
node slides/compile.js — the gate will rescan
Do NOT ship a pptx that has bypassed the gate. Do NOT --no-typo-scan your way out of failures.
Workflow
Phase 1 — Scope + lexicon load
- Confirm target: A-share / STAR / ChiNext / 北交所 / 港股 / 中概股 / 非上市独角兽 —— 决定数据源 tier
- Load / update
references/cn-lexicon.js:
- Target company name (中/英/ticker)
- Top-5 peers (中/英/ticker)
- Industry specific terms(AI 芯片 / 新能源车 / 创新药 / SaaS)
- Decide regulator context (证监会 / 香港证监会 / SEC for US-listed ADR)
Phase 2 — Data collection (按 tier 依次 try,记录 source)
For each required data element (营收/利润/股价/估值/股权结构/管理层/业务线/竞争/风险):
- Call T1 MCP (Tushare / 巨潮 fetch). Record raw output.
- If T1 failed or incomplete, call T2 (交易所官网 / 天眼查). Record.
- Cross-check: pick any hard number from T1 vs T2 vs T3 — require ≥ 2 agreeing sources or flag.
- Record source list in
references/data-provenance.md (update per company) with URL + retrieval timestamp.
Phase 3 — Analysis synthesis (投行传统维度)
Follow the banker-classical analysis frame (customer-analysis-pack skill), enhanced with:
- CN-specific 股权结构 section: 实控人 / 国资 / 员工持股 / 战略投资人 / 解禁时间表
- CN-specific 政策驱动 section: "十四五" / 新基建 / 专精特新 / 国产化替代进度
- CN-specific 监管风险 section: 证监会处罚历史 / 关联交易披露 / ESG 新规
Phase 3.5 — Raw-data snapshot (MANDATORY from v0.9.0)
Why: 2026-04-20 多公司 real-test 发现 MiniMax 在 provenance 的 Source 列里写"Wind (2026-04-17)"、"同花顺 F10"这种并未安装的工具名称作为来源 —— 纯捏造。要堵这个洞,agent 在写 analysis.md 之前必须把真实的 MCP 工具调用结果存成 JSON 快照,作为审计尾迹。
三个 CN MCP(插件 .mcp.json 已声明依赖):
| MCP | 覆盖 | 关键工具 |
|---|
aigroup-market-mcp | 上市公司行情 + 财务 (Tushare) | basic_info / company_performance / stock_data / index_data / finance_news |
PrimeMatrixData | 上市 + 非上市企业工商 + 司法 + 风险 (启信宝) | basic_info / judicial_info / risk_info / shareholder_info / finance_info |
Tianyancha | 上市 + 非上市企业基础 + 风险全景 (天眼查) | companyBaseInfo / risk |
要求:在 <deliverable-dir>/raw-data/ 目录下保存每次 MCP 调用的原始 JSON,文件名格式 {identifier}-{mcp-short}-{tool}.json。
上市公司(有 ts_code,如 002594.SZ / 300750.SZ / 0700.HK)必须包含:
{ts_code}-aigroup-market-mcp-basic_info.json
{ts_code}-aigroup-market-mcp-company_performance.json
{ts_code}-aigroup-market-mcp-stock_data.json
- ≥1 企业风险 overlay:
{uscc}-primematrix-basic_info.json(primary)或 {uscc}-tianyancha-companyBaseInfo.json(备用,见下)
非上市公司(只有 统一社会信用代码 / uscc)必须包含:
- ≥1 企业风险 overlay:
{uscc}-primematrix-basic_info.json(primary)或 {uscc}-tianyancha-companyBaseInfo.json(备用)
(aigroup-market-mcp 不适用,可省略)
强制前置步骤 — 公司名称核验(v0.9.2+):非上市公司调 PrimeMatrixData__basic_info 之前,必须先调 PrimeMatrixData__company_name 模糊查出精确注册名。常见的公众名 ≠ 法定名陷阱:
- "字节跳动" 实际内地主体已于 2023 年改名为 抖音有限公司
- "京东数科" 后改名 京东科技控股股份有限公司
- "滴滴" 的大陆注册主体是 北京小桔科技有限公司
用公众名硬调 basic_info,PrimeMatrix 返回 {} 空对象,下游数据全部空白。raw_data_check.py 现在会检测"PM basic_info 无统一社会信用代码"并 FAIL。正确姿态:
step 1: PrimeMatrixData__company_name(blur_name="字节跳动") → 列出匹配实体
step 2: 人工 / agent 选定法定名 → "抖音有限公司"
step 3: PrimeMatrixData__basic_info(company_name="抖音有限公司") → 完整工商信息
risk_info 空返回警觉:PM risk_info 若只返回 {"公司名称": "..."} 而无 司法/经营异常/关联风险 等字段,不等于"企业干净"——可能是 PM API 对该实体无数据返回。banker 交付前需手工再核一次司法公告/行政处罚/失信被执行人库,不能靠 gate 反向证明。
Tianyancha 当前状态(2026-04+):智谱 MCP broker 的 Tianyancha 账户暂停(余额耗尽)。gate 接受已有 snapshot 但不强制 — PrimeMatrixData 目前是唯一实际可达的企业风险 overlay。需要启用 Tianyancha 时充值 + 按 lead-discovery QUICKSTART 注册即可。
data-provenance.md 要求:每个 raw-data/*.json 文件的文件名 stem 必须在 data-provenance.md 的 Source 列至少出现一次 —— 这建立了"MD 里的数字 ↔ 溯源表 ↔ 原始 MCP 调用"的闭环。
Worked example (BYD 002594.SZ):
deliverables/byd-20260420/
├── raw-data/
│ ├── 002594.SZ-aigroup-market-mcp-basic_info.json ← basic_info 返回:公司简介 + 行业 + 上市日期
│ ├── 002594.SZ-aigroup-market-mcp-company_performance.json ← 营收 / 净利润 / 毛利率 / ROE 时间序列
│ ├── 002594.SZ-aigroup-market-mcp-stock_data.json ← 近 1 年日线 OHLC + 复权
│ └── 91440300192317458F-tianyancha-companyBaseInfo.json ← 工商基本信息 + 统一社会信用代码
├── data-provenance.md (每一行 Source 列写 `aigroup-market-mcp__company_performance` 或对应 raw-data 文件名 stem)
├── analysis.md
├── slides/ ...
└── 比亚迪_deep_analysis.pptx
验证:raw_data_check.py(在 validate-delivery.py aggregator 的第 3c 道 gate 自动跑)会确认:
raw-data/ 目录存在且至少有 3 个 JSON 文件
- 上市 vs 非上市的工具覆盖满足上述要求
- 每个 raw JSON 文件 stem 都在 provenance 里被引用
Back-compat:0.8.x 版本的旧交付物没有 raw-data/ → 默认模式下 gate 3c 给出 WARN 但不 FAIL;--strict-mcp 模式下直接 FAIL。新交付物必须上 raw-data/。
Phase 4 — Deliverable generation (banker-memo preferred, v0.9.6+)
PREFERRED ROUTE (0.9.6+): Prompt-driven banker-memo skill + build_outline_deck.py.
The banker-memo skill dispatches the MiniMax agent through an investment-banker-analyst framework (8-section research memo + content-driven 10-15 slide outline, no fixed page count). Usage:
# 1. After raw-data/ is populated (Phase 3.5), dispatch banker-memo skill
python3 scripts/banker-memo/scripts/build_banker_prompt.py \
<ts_code> <name_cn> <industry> <raw_dir> <out_dir> > /tmp/prompt.md
openclaw agent --agent main --thinking high --json --timeout 600 \
--message "$(cat /tmp/prompt.md)"
# Agent writes analysis.md + slides-outline.md + data-provenance.md
# 2. Compile outline-driven PPT
python3 scripts/cn-client-investigation/scripts/build_outline_deck.py \
<dir> <ts_code> <name_cn> <name_en>
# 3. Close provenance gaps + validate
python3 scripts/cn-client-investigation/scripts/sync_provenance.py <dir>
python3 scripts/cn-client-investigation/scripts/validate-delivery.py --strict-mcp <dir>
Why preferred: 0.9.5 Python-templated build_deck.py produced an 8-slide data dashboard — no industry context, no peer benchmarking, no SOTP / 4C's. 0.9.6 prompt-driven path: agent writes banker narrative (14-20 KB analysis) with peer comparison ([EST] tagged), Data Flag self-reporting (e.g. income vs company_performance 0.59pp discrepancy), SOTP valuation scenarios, 4C's credit framework with specific 授信额度 / 期限 / 利率 / 增信 recommendations.
LEGACY ROUTE (still supported for quick fact-sheets): build_deck.py emits a fixed 8-slide stat-card dashboard. Use only when depth isn't required.
Both routes share: pptxgenjs slide compile + cn_typo_scan post-write gate. NEVER use python-pptx for generation (white background / no theming; only permitted for text extraction inside validate-delivery.py).
For CN targets, these routes override ppt-deliverable's "MiniMax first" routing — only pptxgenjs compile integrates the compile-time typo gate.
Steps:
- Write
slides/slide-01.js … slide-NN.js — each exports createSlide(pres, theme).
- Copy
references/compile_with_typo_gate.template.js.txt → slides/compile.js (strip .txt), set SLIDE_COUNT / OUTPUT_PATH / THEME.
cd slides && node compile.js — the template's post-write gate runs cn_typo_scan.py; if it fails, fix the offending slide-NN.js and recompile.
slides/slide-01.js cover uses Rule 3 (English hero 44pt + Chinese subtitle ≤28pt).
- Every
addText with Chinese content uses the lexicon (require('./references/cn-lexicon.js')); do NOT inline-type long Chinese strings.
Phase 5 — QA (用 validate-delivery.py 单入口)
推荐一条命令跑完六道 gate:
python3 ~/.openclaw/extensions/aigroup-financial-services-openclaw/skills/cn-client-investigation/scripts/validate-delivery.py \
--strict --strict-mcp --style \
/path/to/deliverable_dir
# exit 0 → 全部 PASS;--strict-mcp 下还要求 source 列有 MCP-tool / 官方披露 anchor 且 raw-data/ 齐全
# exit 1 → 至少一道 gate 失败,stderr 指出是哪道 + 具体行号 / 原因
Aggregator 自动按文件名 find 并调度:
*intelligence*.md → verify_intelligence.py(跨插件引用 lead-discovery 的 cn-lead-safety)
*.pptx → 自动 extract text + cn_typo_scan.py
*.pptx + data-provenance.md → slide_data_audit.py(v0.6.0 起:PPT 上每个硬数字必须有 provenance 行支撑;捕获 Phase 4 手改 slide-NN.js 后 provenance 未同步更新的漂移)
analysis.md + data-provenance.md → provenance_verify.py(--strict 开启 estimate-as-T1 smuggling 检测 + 精度漂移 WARN)
data-provenance.md → source_authenticity_check.py (NEW v0.9.0:扫 Source 列,拦 Wind / 同花顺 等未安装工具的伪造标注;--strict-mcp 下非 MCP / 非官方披露全部 FAIL)
<dir>/raw-data/ → raw_data_check.py (NEW v0.9.0:校验 Phase 3.5 的 MCP 调用快照 + 上市 vs 非上市覆盖 + provenance 引用闭环;--strict-mcp 下缺 raw-data/ 直接 FAIL)
- 可选
--style → style_scan.py --warn-only 对 analysis.md 扫货币/期间/日期/YoY 术语一致性(非阻塞)
Debug 单 gate 时仍可分别跑(见各自脚本)。额外手动 QA:
- Sensitive items(未披露财务细节 / 估值倍数)must 标 "估算" / "illustrative" with caption if not from T1-T2 source.
- Final deck 加 "已知缺口 / 数据置信" 汇总 section 到 appendix.
强烈建议:validate-delivery PASS 之后,交付客户之前,再跑一次 data-quality-audit skill(在 paired plugin aigroup-lead-discovery-openclaw 中)做独立交叉源验证。Layer 1 的三道 upstream gate 只能校验交付形式合规性(硬数字有溯源 / 不含 typo / estimate 标注规范),但数据语义正确性(pre vs post restatement / ROE 口径 / 股价基准)需要独立二源拉取数据来打分 —— 这是 data-quality-audit 的职责。validate-delivery 在 overall PASS 时会自动在 stdout 尾部打印 "Next step: run data-quality-audit skill …" 提示。
海天味业 2026-04-19 的 audit 就是一例:三道 upstream gate 全绿,但 data-quality-audit 抓到 EPS 2022 pre/post-restatement FAIL(1.34 vs 1.11 差 20.7%)。现在 --strict 模式 + restatement_aware 规则已把这一类问题部分上移到 Layer 1,但完整的 pre/post 版本识别仍需 Layer 3 独立拉数对比。
Output standard
- MD 底稿 + PPTX 交付
references/data-provenance.md listing every number source
cn_typo_scan.py output attached as QA evidence
- Absolute paths in final report
Integration with existing skills
This skill supersedes the generic customer-investigation + customer-analysis-pack flow when target is a China entity. The banker analysis frame (datapack-builder 8-tab structure, dcf-model WACC methodology, pitch-deck slide conventions) still apply —— this skill adds the text-safety and CN-specific data-source layer on top of them.
For non-CN targets (US / EU / JP / KR / IN / SE-Asia / LATAM), use the generic skills without this overlay.
Not in scope
- Non-Chinese company analysis (use generic banker skills)
- Experimental econometric validation with
aigroup-econ-mcp (lab bundle only; not appropriate for banker client deliverables)
- Embedded markdown→pptx via
aigroup-mdtopptx-mcp (lab bundle only)