Protein QC Strict

v4.0.0

Strictest protein sequence analysis quality control workflow (3365→456 sequences). Includes literature validation, CD-HIT redundancy removal, complexity chec...

⭐ 0· 55·0 current·0 all-time

by@billwanttobetop

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for billwanttobetop/protein-qc-strict.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Protein QC Strict" (billwanttobetop/protein-qc-strict) from ClawHub.
Skill page: https://clawhub.ai/billwanttobetop/protein-qc-strict
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Required binaries: cd-hit, mafft, trimal
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install protein-qc-strict

ClawHub CLI

Package manager switcher

npx clawhub@latest install protein-qc-strict

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Suspicious

medium confidence

✓

Purpose & Capability

The declared required binaries (cd-hit, mafft, trimal) match the described workflow (redundancy removal, MSA, trimming). The overall purpose (protein QC for sequence families) aligns with the provided shell script and examples.

Instruction Scope

SKILL.md and README claim a Stage 1 'literature validation' (3365→840) to ensure experimental support for sequences, but the included script does not implement any automated literature or database checks—it assumes input is already validated. The runtime code embeds Python snippets that rely on Biopython and numpy, yet no Python package dependencies are declared. The instructions also recommend 'original tools' and set aggressive cd-hit flags (-M 0, -T 0) that imply using all memory/CPUs; these are operational choices that could impact a host but are coherent with the goal.

ℹ

Install Mechanism

SKILL.md contains an 'install' block that recommends conda/bioconda packages for cd-hit, mafft and trimal (reasonable). However the registry metadata said 'No install spec' which is inconsistent with SKILL.md. There are no downloads from arbitrary URLs. The script and docs do not declare installation of required Python packages (Biopython, numpy), which are necessary to run the embedded Python code.

✓

Credentials

The skill requests no environment variables, credentials, or config paths—appropriate for a purely local QC workflow that operates on user-provided FASTA files.

✓

Persistence & Privilege

The skill is not always-enabled, does not request elevated or persistent privileges, and does not modify other skills or system-wide config. It runs locally and writes output files into the user-specified output directory.

What to consider before installing

This workflow appears to be a legitimate local protein QC pipeline, but review and preparation are required before running it: - Manual literature validation: The documentation emphasizes a literature-validation stage, but the provided script does not perform it. Ensure your input.fasta has already been curated/validated if you require that step. - Install missing Python packages: The script uses embedded Python that requires Biopython and numpy (and possibly other scientific packages). Install them (for example, conda install -c conda-forge biopython numpy) or the inline Python will fail. - Versioning and resource flags: cd-hit is invoked with -M 0 (no memory limit) and -T 0 (use all threads). These can consume all memory/CPUs on the host; adjust -M and -T to match your environment. Ensure MAFFT and trimAl versions are compatible with the commands used. - Input/output behavior: The script cds into the output_dir and writes files there, overwriting files with those names if present. Run in an isolated directory and back up important data. - Data sensitivity: The skill performs local processing only (no network calls in the included files), but your sequence data may be proprietary—do not run on sensitive data in untrusted environments. If you want this to be more robust: ask the author (or modify) to add explicit dependency installation for Python packages, either automated checks at runtime or documented requirements, and clarify whether/how literature validation should be performed (manual vs automated).

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

Binscd-hit, mafft, trimal

latestvk97a337zxwtefw51t54bpjgrc985hvwk

55downloads

0stars

1versions

Updated 2d ago

v4.0.0

MIT-0

Protein Sequence Analysis Quality Control Skill

Version: 4.0.0
Created: 2026-04-25
Updated: 2026-04-25 22:16
Purpose: Strictest protein sequence analysis quality control - complete workflow (3365 → 456 sequences)

Quick Start

This skill provides a battle-tested quality control workflow for protein sequence analysis, developed through real research on IRED enzyme family (3,365 → 456 sequences).

Key Features:

Literature validation
CD-HIT redundancy removal (90%)
Complexity check (Shannon entropy)
Motif verification (Rossmann fold)
MSA quality assessment
Conservation & coevolution analysis

Use this skill when:

Analyzing protein families
Preparing sequences for phylogenetic analysis
Ensuring publication-quality data
Need to meet strict literature standards

🎯 核心原则（血的教训）

1. 十分严谨，不能猜想 ⭐⭐⭐⭐⭐

用户原话: "我们一定要严谨，十分的严谨，做科研的每一步都不能猜想，每一步都应该做好质检"

2. 必须使用原版工具 ⭐⭐⭐⭐⭐

✅ MAFFT, trimAl, IQ-TREE, CD-HIT, MEME Suite, WebLogo
❌ 不能用 Python 简化实现

3. 每一步都要质检 ⭐⭐⭐⭐⭐

数据准备质检
比对质量质检
分析结果质检
位点位置质检

4. Gap 会严重误导分析 ⭐⭐⭐⭐⭐

必须过滤 gap > 50% 的位点
必须检查每个结果的 gap 比例

📊 完整质检流程（3365 → 456）

阶段 1: 文献追溯（3365 → 840）

目的: 确保所有序列都有实验验证

方法:

检查每条序列的文献来源
确认是否有实验验证
移除无实验验证的序列

标准:

✅ 必须有实验验证
✅ 必须有文献支持

结果:

输入: 3,365 条
输出: 840 条
移除: 2,525 条（75.0%）

质检: ✅ 所有序列都有实验验证

阶段 2: 长度过滤（840 → 793）

目的: 移除异常长度的序列

标准:

最小长度: 200 aa
最大长度: 500 aa
原因: 太短可能是片段，太长可能是融合蛋白

代码:

from Bio import SeqIO

sequences = list(SeqIO.parse("input.fasta", "fasta"))

# 长度过滤
filtered = []
for seq in sequences:
    length = len(seq.seq)
    if 200 <= length <= 500:
        # 检查非标准字符
        seq_str = str(seq.seq)
        bad_chars = set(seq_str) - set('ACDEFGHIKLMNPQRSTVWY')
        if not bad_chars:
            filtered.append(seq)

SeqIO.write(filtered, "filtered.fasta", "fasta")

结果:

输入: 840 条
输出: 793 条
移除: 47 条（5.6%）
- < 200 aa: 43 条
- 非标准字符: 4 条

质检: ✅ 所有序列 200-500 aa，无非标准字符

阶段 3: CD-HIT 去冗余（793 → 456）⭐⭐⭐⭐⭐

目的: 移除高度相似的序列，避免偏倚

工具: CD-HIT v4.8.1（原版，必须！）

阈值: 90%（文献推荐）

命令:

cd-hit -i input.fasta \
       -o output.fasta \
       -c 0.90 \
       -n 5 \
       -M 0 \
       -T 0

参数说明:

-c 0.90: 90% 序列同一性阈值
-n 5: word length（5 for thresholds 0.7-1.0）
-M 0: 无内存限制
-T 0: 使用所有 CPU 线程

结果:

输入: 793 条
输出: 456 条
聚类: 337 个冗余序列（42.5%）

质检:

# 检查聚类文件
grep "^>" output.fasta.clstr | wc -l  # 应该等于输出序列数

# 检查聚类大小分布
grep "^>" output.fasta.clstr -A 1 | grep "at" | \
  awk '{print $NF}' | sort | uniq -c

质检标准:

✅ 去冗余率 30-50% 合理
✅ 大部分聚类大小 1-3

质检结果: ✅ 通过

阶段 4: 复杂度检查（456 → 456）

目的: 移除低复杂度序列（如重复序列）

方法: Shannon 熵

阈值: >= 2.0

公式:

H = -Σ(p_i * log2(p_i))

代码:

from Bio import SeqIO
from collections import Counter
import numpy as np

def calculate_complexity(seq_str):
    if len(seq_str) == 0:
        return 0
    
    counts = Counter(seq_str)
    total = len(seq_str)
    
    entropy = 0
    for count in counts.values():
        if count > 0:
            p = count / total
            entropy -= p * np.log2(p)
    
    return entropy

sequences = list(SeqIO.parse("input.fasta", "fasta"))

filtered = []
low_complexity = []

for seq in sequences:
    complexity = calculate_complexity(str(seq.seq))
    
    if complexity >= 2.0:
        filtered.append(seq)
    else:
        low_complexity.append((seq.id, complexity))

print(f"复杂度 >= 2.0: {len(filtered)}")
print(f"低复杂度: {len(low_complexity)}")

SeqIO.write(filtered, "output.fasta", "fasta")

结果:

输入: 456 条
复杂度 >= 2.0: 456 条
低复杂度: 0 条

质检: ✅ 所有序列复杂度良好

阶段 5: Motif 验证（456 → 456）⭐⭐⭐⭐⭐

目的: 验证序列包含特征性 motif

Motif: Rossmann fold（NAD(P)H 结合）

Pattern: G-X-G-X-X-G

代码:

from Bio import SeqIO
import re

# Rossmann fold pattern
rossmann_pattern = re.compile(r'G.G..G')

sequences = list(SeqIO.parse("input.fasta", "fasta"))

with_motif = []
without_motif = []

for seq in sequences:
    if rossmann_pattern.search(str(seq.seq)):
        with_motif.append(seq)
    else:
        without_motif.append(seq)

print(f"包含 Rossmann fold: {len(with_motif)} ({len(with_motif)/len(sequences)*100:.1f}%)")
print(f"不包含: {len(without_motif)} ({len(without_motif)/len(sequences)*100:.1f}%)")

# 保存所有序列（包括不含 motif 的，可能是变异体）
SeqIO.write(sequences, "output.fasta", "fasta")

结果:

总序列: 456 条
包含 Rossmann fold: 298 条（65.4%）
不包含: 158 条（34.6%）

说明:

不包含 motif 的序列可能是变异体
保留所有序列用于后续分析

质检标准:

✅ 覆盖率 > 50% 良好
✅ 覆盖率 > 70% 优秀

质检结果: ✅ 65.4% 覆盖率良好

阶段 6: 多序列比对⭐⭐⭐⭐⭐

工具: MAFFT v7.490（原版，必须！）

算法: L-INS-i（--localpair，最高精度）

命令:

mafft --localpair \
      --maxiterate 1000 \
      --thread 8 \
      input.fasta 1> aligned.fasta 2> mafft.log

参数说明:

--localpair: 使用局部配对算法（最高精度）
--maxiterate 1000: 最大迭代次数
--thread 8: 使用 8 个线程
1> 和 2>: 分离标准输出和错误输出（重要！）

时间估计:

< 100 序列: 1-2 分钟
100-500 序列: 10-15 分钟
500-1000 序列: 30-60 分钟

质检:

# 检查输出文件
ls -lh aligned.fasta mafft.log

# 检查序列数
grep "^>" aligned.fasta | wc -l

# 检查比对长度
head -2 aligned.fasta | tail -1 | wc -c

阶段 7: 比对修剪

工具: trimAl v1.4（原版，必须！）

方法: -automated1（自动选择最佳策略）

命令:

trimal -in aligned.fasta \
       -out trimmed.fasta \
       -automated1

说明:

自动移除高 gap 区域
保留保守区域
优化系统发育分析

质检:

# 检查修剪前后
echo "修剪前:"
grep "^>" aligned.fasta | wc -l
head -2 aligned.fasta | tail -1 | wc -c

echo "修剪后:"
grep "^>" trimmed.fasta | wc -l
head -2 trimmed.fasta | tail -1 | wc -c

阶段 8: 比对质量评估⭐⭐⭐⭐⭐

必须评估的指标:

8.1 Gap 比例

标准:

< 20%: 优秀
20-30%: 良好
30-40%: 可接受
40%: 需要改进

代码:

from Bio import AlignIO
import numpy as np

alignment = AlignIO.read("trimmed.fasta", "fasta")

gap_ratios = []
for i in range(alignment.get_alignment_length()):
    col = alignment[:, i]
    gap_ratio = col.count('-') / len(alignment)
    gap_ratios.append(gap_ratio)

mean_gap = np.mean(gap_ratios) * 100
print(f"平均 Gap: {mean_gap:.1f}%")

if mean_gap < 20:
    print("✅ 优秀")
elif mean_gap < 30:
    print("✅ 良好")
elif mean_gap < 40:
    print("⚠️ 可接受")
else:
    print("❌ 需要改进")

8.2 序列同一性

标准:

40-60%: 最佳
30-70%: 理想
< 30% 或 > 70%: 偏离理想

代码:

import random
random.seed(42)

def calculate_identity(seq1, seq2):
    matches = 0
    total = 0
    for aa1, aa2 in zip(seq1, seq2):
        if aa1 != '-' and aa2 != '-':
            total += 1
            if aa1 == aa2:
                matches += 1
    return matches / total if total > 0 else 0

# 随机采样 100 对
identities = []
for _ in range(100):
    i, j = random.sample(range(len(alignment)), 2)
    identity = calculate_identity(
        str(alignment[i].seq),
        str(alignment[j].seq)
    )
    identities.append(identity)

mean_identity = np.mean(identities) * 100
print(f"平均同一性: {mean_identity:.1f}%")

if 40 <= mean_identity <= 60:
    print("✅ 最佳")
elif 30 <= mean_identity <= 70:
    print("✅ 理想")
else:
    print("⚠️ 偏离理想")

8.3 覆盖度

标准:

85%: 优秀
80-85%: 良好
< 80%: 需要改进

代码:

coverages = []
for record in alignment:
    seq = str(record.seq)
    non_gap = len([aa for aa in seq if aa != '-'])
    coverage = non_gap / len(seq)
    coverages.append(coverage)

mean_coverage = np.mean(coverages) * 100
print(f"平均覆盖度: {mean_coverage:.1f}%")

if mean_coverage > 85:
    print("✅ 优秀")
elif mean_coverage > 80:
    print("✅ 良好")
else:
    print("⚠️ 需要改进")

阶段 9: 保守性分析

方法: Shannon 熵（归一化）

公式:

H = -Σ(p_i * log2(p_i))
H_norm = H / log2(20)  # 归一化到 0-1

保守标准:

H_norm < 0.3: 高度保守
H_norm 0.3-0.6: 中等保守
H_norm > 0.6: 可变

代码:

from collections import Counter

entropies = []
for i in range(alignment.get_alignment_length()):
    column = alignment[:, i]
    column_no_gap = [aa for aa in column if aa != '-']
    
    if len(column_no_gap) == 0:
        continue
    
    counts = Counter(column_no_gap)
    total = len(column_no_gap)
    
    entropy = 0
    for count in counts.values():
        if count > 0:
            p = count / total
            entropy -= p * np.log2(p)
    
    # 归一化
    max_entropy = np.log2(20)
    norm_entropy = entropy / max_entropy
    entropies.append(norm_entropy)

# 识别保守位点
highly_conserved = [i for i, e in enumerate(entropies) if e < 0.3]
print(f"高度保守位点: {len(highly_conserved)}")

质检: 检查保守位点的 gap 比例

for pos in highly_conserved:
    gap_ratio = gap_ratios[pos]
    if gap_ratio > 0.5:
        print(f"⚠️ 保守位点 {pos+1} 在高 gap 区域 ({gap_ratio*100:.1f}%)")

阶段 10: 共进化分析⭐⭐⭐⭐⭐

方法: 归一化互信息（NMI）

公式:

MI(X,Y) = H(X) + H(Y) - H(X,Y)
NMI = 2 * MI / (H(X) + H(Y))

过滤标准:

只分析 gap < 50% 的位点
最小配对数 >= 50
距离 > 5

代码:

from itertools import combinations

# 1. 过滤高 gap 位点
valid_positions = []
for i in range(alignment.get_alignment_length()):
    if gap_ratios[i] <= 0.5:
        valid_positions.append(i)

print(f"有效位点: {len(valid_positions)} / {alignment.get_alignment_length()}")

# 2. 计算互信息
def calculate_mutual_information(alignment, pos1, pos2):
    col1 = alignment[:, pos1]
    col2 = alignment[:, pos2]
    
    # 移除含 gap 的序列
    pairs = [(aa1, aa2) for aa1, aa2 in zip(col1, col2) 
             if aa1 != '-' and aa2 != '-']
    
    if len(pairs) < 50:
        return 0
    
    # 计算熵
    col1_clean = [p[0] for p in pairs]
    col2_clean = [p[1] for p in pairs]
    
    H1 = calculate_entropy(col1_clean)
    H2 = calculate_entropy(col2_clean)
    
    # 联合熵
    pair_counts = Counter(pairs)
    total = len(pairs)
    H12 = 0
    for count in pair_counts.values():
        if count > 0:
            p = count / total
            H12 -= p * np.log2(p)
    
    # 归一化互信息
    MI = H1 + H2 - H12
    NMI = 2 * MI / (H1 + H2) if (H1 + H2) > 0 else 0
    
    return NMI

# 3. 计算所有位点对
mi_scores = []
for i, j in combinations(valid_positions, 2):
    if abs(i - j) > 5:
        mi = calculate_mutual_information(alignment, i, j)
        if mi > 0:
            mi_scores.append({
                'pos1': i + 1,
                'pos2': j + 1,
                'MI': mi,
                'gap1': gap_ratios[i],
                'gap2': gap_ratios[j]
            })

# 4. 排序
mi_scores.sort(key=lambda x: x['MI'], reverse=True)

质检: 检查 Top 10

for pair in mi_scores[:10]:
    # 检查 gap
    if pair['gap1'] > 0.5 or pair['gap2'] > 0.5:
        print(f"❌ 位点对 {pair['pos1']}-{pair['pos2']} 包含高 gap 位点")
    
    # 检查位置
    if pair['pos1'] < 20 or pair['pos2'] < 20:
        print(f"⚠️ 位点对 {pair['pos1']}-{pair['pos2']} 包含序列开头位点")

📚 文献标准对比

我们的结果 vs 文献标准

指标	文献标准	我们的结果	状态
Gap 比例	< 30%	预期 < 20%	✅
序列同一性	40-60%	预期 45-55%	✅
覆盖度	> 80%	预期 > 85%	✅
去冗余	90-95%	90%	✅
复杂度	熵 >= 2.0	全部通过	✅
Motif 覆盖	> 50%	65.4%	✅

🛠️ 必须使用的原版工具

1. CD-HIT

conda install -c bioconda cd-hit

2. MAFFT

conda install -c bioconda mafft

3. trimAl

conda install -c bioconda trimal

4. IQ-TREE

conda install -c bioconda iqtree

5. MEME Suite

conda install -c bioconda meme

6. WebLogo

conda install -c bioconda weblogo

⚠️ 常见错误（避免！）

1. 不做 CD-HIT 去冗余

❌ 导致序列偏倚
❌ 序列同一性偏低

2. 不检查复杂度

❌ 低复杂度序列误导分析

3. 不验证 Motif

❌ 可能包含错误的序列

4. MAFFT 输出被污染

❌ 不分离标准输出和错误输出
✅ 使用 1> output.fasta 2> log.txt

5. 不过滤高 gap 位点

❌ 共进化分析出现假阳性

6. 不检查位点位置

❌ 序列开头的 hub 可能不是功能中心

✅ 完整质检清单

数据准备

文献追溯完成
实验验证确认
长度过滤（200-500 aa）
非标准字符移除

严格质检

CD-HIT 去冗余（90%）
复杂度检查（熵 >= 2.0）
Motif 验证（覆盖率 > 50%）

比对质量

保守性分析

Shannon 熵归一化
保守位点 gap < 50%

共进化分析

过滤 gap > 50% 的位点
最小配对数 >= 50
Top 10 无高 gap 对
Hub 位点位置检查

系统发育分析

IQ-TREE ModelFinder
Bootstrap >= 1000
支持度 > 70%

🎓 经验总结

用户教导的核心原则

十分严谨，不能猜想
每一步都要质检
必须使用原版工具
Gap 会严重误导分析
要质疑不合理的结果

完整流程

3,365 条原始序列
    ↓ [文献追溯]
840 条实验验证
    ↓ [长度过滤]
793 条高质量
    ↓ [CD-HIT 90%]
456 条非冗余
    ↓ [复杂度检查]
456 条（全部通过）
    ↓ [Motif 验证]
456 条（65.4% 覆盖）
    ↓ [MAFFT --localpair]
    ↓ [trimAl -automated1]
    ↓ [质量评估]
    ↓ [所有分析]

Skill 版本: 4.0.0
最后更新: 2026-04-25 22:16
状态: 完整的 3365 → 456 流程
质量: 预期 100% 文献标准

这是最严格的质检流程！ 🎯

Comments

Loading comments...