Install
openclaw skills install paper-defense-qa-code-trainingPrepare evidence-grounded computer science paper defense Q&A, code/training audit, reviewer-style attack surfaces, mock-defense scripts, backup-slide plans,...
openclaw skills install paper-defense-qa-code-trainingUse this skill when the user wants to prepare for a paper defense, thesis defense, lab-meeting Q&A, conference rebuttal discussion, advisor grilling, reviewer-style mock exam, or PPT defense for a computer-science / machine-learning paper.
This skill is designed as a downstream companion to paper_deep_reading_teaching_explainer_v10. The v10 skill produces the authoritative deep-reading and teaching report. This skill turns that report, the original paper, and any code / training artifacts into a defense-ready attack map and answer bank.
The central output is not a generic FAQ. It is a paper-specific and code-specific defense pack that asks:
What will a skeptical committee, reviewer, peer, or engineer ask about this exact paper?
What evidence from the paper / code / logs supports the answer?
What should the speaker say, and what should they avoid overclaiming?
For each target paper, produce:
The output should help the user answer hard questions without bluffing, over-defending, or hiding limitations.
Read the v10 outputs first when available:
reports/per_paper/<paper-slug>/<paper-slug>_detailed_cn.mdgenerated/teaching/<paper-slug>/qa_bank_cn.jsongenerated/teaching/<paper-slug>/slide_blueprint_cn.jsonmetadata/source_record.jsonmetadata/project_directory_index.jsonmetadata/routing_status.jsonmetadata/delivery_bundle_manifest.jsonDo not replace the authoritative v10 detailed report. Treat it as the evidence base and build a derivative defense layer.
If the v10 detailed report is unavailable, this skill may still work from the paper PDF and code artifacts, but the output must mark the missing deep-read report as a blocker or reduced-confidence condition.
Use or infer:
{
"defense_context": "thesis_defense | lab_meeting | conference_rebuttal | paper_reading_group | advisor_grilling | job_talk | proposal_defense",
"target_audience": ["advisor", "committee", "reviewer", "peer", "beginner", "practitioner"],
"paper_subfield": "e.g. ML systems / CV / NLP / security / theory / databases / HCI",
"known_weaknesses": [],
"known_sensitive_points": [],
"available_time_minutes": 30,
"expected_qna_minutes": 30,
"code_available": true,
"training_artifacts_available": true,
"risk_tolerance": "conservative"
}
Every generated answer must be tagged as one of:
paper_grounded: directly supported by the paper, appendix, supplement, or official review materials;code_grounded: supported by repository files, configs, logs, scripts, or checkpoints;experiment_log_grounded: supported by runs, seeds, hardware records, sweep logs, result tables, or failure cases;review_grounded: supported by public reviewer / rebuttal / meta-review text;inferred: reasoned from available evidence but not explicitly stated;missing_evidence: plausible but not supported by current materials;external_context: supported by official venue guidelines, benchmark docs, dataset docs, or verified external sources.Do not merge these labels. If an answer depends on inference or missing evidence, say so explicitly.
A strong defense answer has this shape:
Claim: what we can defend.
Evidence: where the paper/code/log supports it.
Boundary: what the evidence does not show.
Why it matters: why the design or result is still meaningful.
Follow-up: what test, ablation, or implementation check would resolve the remaining doubt.
When the evidence is incomplete, use the balanced-defense template:
The current materials support X through [paper section / figure / table / code / log].
They do not fully establish Y because [missing baseline / missing seed variance / unavailable training log / untested setting].
So the safe answer is: X is supported under [scope], while Y remains a limitation.
A fair follow-up would be [specific experiment or code check].
Never answer a hard question by pretending the paper proves more than it does.
This skill may prepare a series of illustrated Q&A cards, storyboard frames, and image prompts, but text answering and image generation must be separated.
Always do the text work first:
For ChatGPT web usage, tell the user to open or invoke Create image mode and ask to generate the numbered image prompts from visual_image_prompt_pack_cn.md. Prefer ChatGPT Images 2.0 / gpt-image-2 when available.
For Codex / CLI / agent usage, do not use low-quality local placeholder art for the final visual cards. Use ChatGPT Images 2.0 / gpt-image-2 through an approved image-generation API, or another user-approved high-quality text-to-image API. Keep the Markdown/JSON answer pack and the API image-generation call as separate steps.
When exact equations, code snippets, or tiny labels must be perfectly readable, generate the image as a clean background / metaphor / layout and overlay exact text later in PPT, SVG, HTML, or another deterministic renderer. Use generated text inside the image only when small wording errors would not damage the defense.
At the end of every text-only delivery or textual response produced by this skill, append exactly this follow-up prompt as the final image-generation reminder. Do this after the written answer, while still keeping image generation as a separate follow-up step:
请用chatgpt images 2.0 生成一系列 16:9 辅助答辩的插图,这些插图可以图文并茂的覆盖上面文字中的问题以及用生动的图表来解释回答,有助于更好准备答辩回答。
This sentence is a prompt for the next image-generation step, not an instruction to generate images inside the current text-only answer.
Run the workflow in eight passes.
Read the project state and identify:
Output a short Defense Scope block and a Blockers / Evidence Gaps block before generating detailed answers.
Extract every major defensible claim:
For each claim, record the exact evidence:
Claim -> Evidence -> Strength -> Caveat -> Likely question -> Answer posture
Generate questions from these axes:
If code or training artifacts are available, audit them. If they are not available, generate questions that a committee would ask and mark the answers as missing_evidence or paper_grounded.
Audit axes:
| Axis | What to inspect | Typical defense risk |
|---|---|---|
| Repository entry points | train.py, eval.py, configs, README commands | unclear reproducibility path |
| Dependencies and environment | versions, CUDA, packages, Docker, hardware | results depend on hidden environment |
| Data pipeline | split generation, preprocessing, augmentation, leakage prevention | train/test leakage or unfair comparison |
| Model implementation | architecture, equation-to-code mapping, initialization, frozen modules | paper method not faithfully implemented |
| Loss and optimization | objective signs, weights, schedules, optimizer, gradient clipping, mixed precision | fragile or under-explained training tricks |
| Hyperparameter search | search space, validation protocol, selected configs | cherry-picking or unfair tuning |
| Randomness and seeds | number of runs, seed control, deterministic settings | unreported variance |
| Checkpoint selection | early stopping, best-on-validation vs test, model averaging | implicit test tuning |
| Evaluation code | metric implementation, post-processing, statistical aggregation | metric mismatch or inflated performance |
| Baseline reproduction | official implementations, tuning budget, same data preprocessing | weak or unfair baselines |
| Compute and cost | GPU type, hours, memory, energy estimate, inference latency | impractical or non-comparable cost |
| Released artifacts | pretrained weights, logs, commands, result tables | cannot verify main claims |
| Failure runs | negative results, instability, divergent runs | hidden brittleness |
| Licenses and ethics | dataset license, model release, PII, safety risks | legal or ethical blind spots |
For every code-related question, include:
Question:
Likely trigger:
Evidence to check:
Safe answer if evidence exists:
Safe answer if evidence is missing:
Backup artifact to prepare:
Classify every question by:
likelihood: high / medium / low;severity: high / medium / low;answer_readiness: ready / needs evidence / risky / cannot defend;audience: beginner / peer / advisor / reviewer / committee / practitioner / author-defender / bug-hunter;attack_axis: novelty / soundness / reproducibility / experiment / code / training / theory / ethics / presentation;answer_mode: concise / technical / evidence-heavy / limitation-acknowledging / bridge-to-future-work.Prioritize as follows:
| Priority | Condition | Required action |
|---|---|---|
| P0 | high likelihood + high severity + weak evidence | prepare honest limitation answer and backup slide |
| P1 | high likelihood + high severity + strong evidence | memorize answer and evidence reference |
| P2 | medium likelihood + high severity | prepare short answer and one backup detail |
| P3 | high likelihood + low severity | answer briefly, do not spend too much time |
| P4 | low likelihood + low severity | keep as appendix / backup only |
For each major question, produce:
Q_ID:
Question:
Audience:
Attack axis:
Why they may ask:
Expected concern:
Short answer, 1-2 sentences:
Long answer:
Evidence references:
Confidence:
What not to overclaim:
Backup slide / artifact:
If challenged again:
Follow-up experiment or code check:
The answer bank must include at least:
For small papers or short user requests, reduce counts but keep all categories represented.
Create a rehearsal sequence:
For each round, include expected answer length, likely interruption, and a recovery phrase.
When the user wants more intuitive, figure-rich defense preparation, transform high-value Q&A items into a visual series. Prioritize P0/P1 questions, code/training questions, and concepts that are hard to explain verbally.
Create these visual artifacts:
visual_qa_storyboard_cn.md
visual_qa_storyboard_cn.json
visual_image_prompt_pack_cn.md
visual_generation_handoff_cn.md
visual_card_copy_cn.md
Use this structure for each visual card:
Card ID:
Linked Q_IDs:
Purpose:
Question shown to audience:
Spoken answer summary:
Evidence label:
Boundary / what not to overclaim:
Visual metaphor or diagram:
Image prompt:
Text overlay plan:
Follow-up prompt for revision:
Generation status: text_ready | image_pending | image_generated
Recommended visual card types:
| Card type | Use when | Visual idea |
|---|---|---|
| Claim-evidence map | The question asks “where is that proven?” | claim nodes connected to paper/code/log evidence |
| Attack-surface radar | Many risks must be prioritized | radar or heatmap of novelty, soundness, reproducibility, compute |
| Method pipeline | The method is hard to explain | left-to-right architecture / data-flow storyboard |
| Equation-to-code bridge | The committee may ask whether implementation matches math | equation block connected to files/functions/config keys |
| Training timeline | Training stability or compute is questioned | timeline from data split to final checkpoint |
| Baseline fairness board | Reviewers may attack experiments | comparison table metaphor with same data, same metric, same tuning budget |
| Limitation boundary card | The answer must avoid overclaiming | safe zone / out-of-scope boundary diagram |
| Recovery answer card | The speaker needs a memorized answer | question bubble + claim/evidence/boundary/follow-up pattern |
| Backup slide visual | P0/P1 question needs a reserve slide | clean academic appendix-style figure |
Keep the visual series coherent: same aspect ratio, same visual style, same paper title convention, same evidence-label icons, and consistent terminology.
Question pattern:
Isn't this just [prior method A] + [known trick B]?
Answer pattern:
The closest prior work is indeed [A], and [B] is inherited.
The new part is [specific constraint / unavailable mechanism / surrogate / training setup].
The evidence that this is not only a plug-in is [ablation / comparison / theoretical analysis / failure of direct baseline].
The limitation is [what remains incremental or under-tested].
Question pattern:
Why didn't you compare with [obvious baseline]?
Were baselines tuned fairly?
Answer pattern:
The paper compares against [included baselines] because they cover [families].
[Missing baseline] would be relevant because [reason].
If it is absent, we should state that this is a limitation rather than dismiss it.
The fair follow-up is to run [baseline] under the same preprocessing, tuning budget, and metric.
Question pattern:
Does the ablation prove the module's claimed role, or only show a performance drop?
Answer pattern:
The ablation supports [local contribution] by removing/changing [component].
It does not by itself prove [causal explanation] unless the paper also shows [mechanism evidence].
So the defensible claim is [narrow claim].
A stronger test would be [targeted ablation / diagnostic / stress condition].
Question pattern:
How sensitive is this result to seeds, hyperparameters, and compute?
Answer pattern:
The reproducibility evidence available is [number of runs / std / config / logs / hardware].
The weak point is [missing variance / missing sweep / undocumented resource].
The safe answer is that the reported result is supported under [specified setup], but robustness to [unreported factor] is not fully established.
Question pattern:
How do we know the code implements the method described in the equations?
Answer pattern:
Map equation/module [X] to [file:function/config].
The key variables are [paper symbols] corresponding to [code names].
The training/evaluation entry points are [commands].
If this mapping is absent from the repository, state that reproducibility is weakened and prepare a code-to-equation table.
Question pattern:
Where does the method fail?
Answer pattern:
The paper's tested scope is [datasets/settings].
Within that scope, the weakest evidence is [failure case / lower-performing setting / missing stress test].
The likely failure mode is [assumption violated].
This does not invalidate the contribution, but it narrows the claim to [safe scope].
Question pattern:
Could the method cause harm or be misused?
Answer pattern:
The relevant risk is [privacy / bias / misuse / security / environmental cost / human-subject risk].
The paper addresses it through [evidence] or does not address it sufficiently.
The safe defense is to state the actual mitigation and identify what remains unresolved.
When building a full defense pack, create:
generated/defense/<paper-slug>/defense_scope_cn.md
generated/defense/<paper-slug>/claim_evidence_map_cn.md
generated/defense/<paper-slug>/paper_attack_surface_cn.md
generated/defense/<paper-slug>/code_training_audit_cn.md
generated/defense/<paper-slug>/defense_qa_bank_cn.md
generated/defense/<paper-slug>/defense_qa_bank_cn.json
generated/defense/<paper-slug>/answer_playbook_cn.md
generated/defense/<paper-slug>/mock_defense_script_cn.md
generated/defense/<paper-slug>/backup_slide_plan_cn.md
generated/defense/<paper-slug>/evidence_gap_triage_cn.md
generated/defense/<paper-slug>/visual_qa_storyboard_cn.md
generated/defense/<paper-slug>/visual_qa_storyboard_cn.json
generated/defense/<paper-slug>/visual_image_prompt_pack_cn.md
generated/defense/<paper-slug>/visual_generation_handoff_cn.md
generated/defense/<paper-slug>/visual_card_copy_cn.md
If the user only asks for a short Q&A list, produce a compact Markdown answer but still follow evidence labels.
The main defense_qa_bank_cn.md must contain:
答辩范围与证据状态论文一句话主张与最安全表述核心贡献的可防守版本高风险问题总览论文层面问题与回答方法 / 公式 / 理论问题与回答实验 / 消融 / 基线问题与回答代码与训练过程问题与回答可复现性与工程实现问题与回答局限性 / 失败模式 / 伦理风险问题与回答未来工作与研究边界问题与回答最不该说的话备份页与证据材料清单模拟答辩脚本最后 10 分钟速记卡图文答辩卡片与生图提示词Use this taxonomy as a minimum list. Expand based on the target paper.
Goal: check whether the speaker can explain without jargon.
Answer style:
Goal: check mechanism and details.
Answer style:
Goal: check independence, judgment, and limitation awareness.
Answer style:
Goal: attack novelty, soundness, reproducibility, and evidence.
Answer style:
Goal: check cost, reliability, adoption, and operational risk.
Answer style:
For every generated answer, perform a red-team pass:
If yes, rewrite the answer conservatively.
When code is available, build a mapping table:
| Paper object | Paper location | Code path | Function / class | Config key | Evidence strength | Risk |
|---|
Include at least:
If code line numbers are known, include them. If not, include file and function names.
When logs or configs are available, build a training-run audit:
| Run / config | Dataset | Seed | Hardware | Time | Key hyperparameters | Result | Matches paper? | Notes |
|---|
Look for:
For high-risk questions, propose backup slides:
| Question | Backup slide title | Content | Evidence | When to show |
|---|
Backup slides should cover:
Always identify the top 5-10 most dangerous questions.
For each:
Why dangerous:
Current evidence:
Best honest answer:
What not to say:
How to reduce risk before the defense:
Dangerous questions often come from:
Use Chinese by default unless the user asks otherwise. Keep technical terms in English when they are standard in the field, but explain them.
Use compact tables for prioritization and evidence mapping. Use natural spoken answers for Q&A items, because the user needs to speak them during defense.
Prefer exact, defendable answers over impressive but vague answers.
For visual outputs, keep the image prompts practical: specify the card purpose, audience, visual metaphor, layout, style, aspect ratio, and exact text to overlay outside the image when needed. Avoid decorative images that do not help answer a real defense question.
A complete defense pack should look like:
paper_defense_bundle/
metadata/
defense_focus_spec.json
defense_generation_status.json
generated/
defense/
<paper-slug>/
defense_scope_cn.md
claim_evidence_map_cn.md
paper_attack_surface_cn.md
code_training_audit_cn.md
defense_qa_bank_cn.md
defense_qa_bank_cn.json
answer_playbook_cn.md
mock_defense_script_cn.md
backup_slide_plan_cn.md
evidence_gap_triage_cn.md
visual_qa_storyboard_cn.md
visual_qa_storyboard_cn.json
visual_image_prompt_pack_cn.md
visual_generation_handoff_cn.md
visual_card_copy_cn.md
reports/
stage_delivery_handoff.md
Before handoff, verify:
defense_qa_bank_cn.json is valid JSON;image_pending until a separate image-generation step is explicitly requested.Use scripts/validate_defense_qa_bundle.py when operating locally.
For ClawHub distribution, keep the skill folder name URL-safe and lowercase: paper-defense-qa-code-training. Keep SKILL.md frontmatter with name, description, and semver version. Include only text-based support files, keep the bundle under 50MB, and use MIT-0 / MIT No Attribution terms. Use scripts/package_clawhub_skill.py to validate and produce .zip and .skill archives.
At the end of every substantive reply using this skill, append:
Current Status
Recommended Next Skill
Possible User Inputs For Next Stage
Typical next skill:
slides_creation_or_ppt_refinement if the user wants PPT;paper_deep_reading_teaching_explainer_v10 if the paper has not yet been deeply read;report-innovation-graph-workbench if the user wants future-direction mining after defense prep;visual_image_prompt_pack_cn.md prompts into actual figures.