Install
openclaw skills install protein-sequence-qc-proProfessional protein sequence quality control and visualization workflow. Includes complete QC pipeline (length filter, CD-HIT, complexity check, motif verification, MSA, trimming), conservation/coevolution analysis, and Nature-style publication-ready figures. Based on multi-source IRED dataset analysis (3,365 → 1,531 sequences).
openclaw skills install protein-sequence-qc-proVersion: 5.0.0
Created: 2026-05-08
Purpose: Professional protein sequence QC with publication-ready figures
This skill provides a complete, battle-tested quality control workflow for protein sequence analysis, with automatic generation of Nature-style publication-ready figures.
Key Features:
Use this skill when:
Raw sequences (3,365)
↓ [Length filter: 200-500 aa]
2,963 sequences (88.1%)
↓ [CD-HIT 90% redundancy removal]
1,531 sequences (45.5%)
↓ [Complexity check: entropy ≥ 2.0]
1,531 sequences (100%)
↓ [Motif verification: Rossmann fold]
1,531 sequences (67.7% coverage)
↓ [MAFFT alignment: --localpair]
1,928 columns
↓ [trimAl: -automated1]
164 columns (8.5%)
↓ [Quality assessment]
↓ [Conservation analysis: 8 sites]
↓ [Coevolution analysis: Top 50 pairs]
↓ [Generate 12+ figures]
✅ Publication-ready dataset
# Run complete QC pipeline
python3 scripts/run_complete_qc.py \
--input raw_sequences.fasta \
--output qc_results/ \
--threads 8
# Generate all figures
python3 scripts/generate_all_figures.py \
--analysis qc_results/analysis/ \
--output figures/
# Custom QC parameters
python3 scripts/run_complete_qc.py \
--input raw_sequences.fasta \
--output qc_results/ \
--min-length 200 \
--max-length 500 \
--cdhit-threshold 0.90 \
--complexity-threshold 2.0 \
--threads 8
# Generate Nature-style figures only
python3 scripts/generate_nature_figures.py \
--analysis qc_results/analysis/ \
--output figures/nature/
All figures follow Nature journal standards:
# Generate Nature-style conservation landscape
python3 scripts/generate_nature_conservation_landscape.py \
--analysis qc_results/analysis/ \
--output figures/
Output:
figure_nature_01_conservation_landscape.png (300 DPI)figure_nature_01_conservation_landscape.pdf (vector)Figure panels:
| Metric | Excellent | Good | Acceptable | Poor |
|---|---|---|---|---|
| Gap ratio | < 20% | 20-30% | 30-40% | > 40% |
| Sequence identity | 40-60% | 30-70% | 20-80% | < 20% or > 80% |
| Coverage | > 85% | 80-85% | 75-80% | < 75% |
| Conserved sites | > 10 | 5-10 | 3-5 | < 3 |
Formula:
H = -Σ(p_i * log2(p_i))
H_norm = H / log2(20)
Classification:
Important: Always check Gap ratio for conserved sites!
# Check conserved sites quality
for site in conserved_sites:
if site['gap_ratio'] > 0.5:
print(f"⚠️ Site {site['position']} has high gap ({site['gap_ratio']:.1%})")
High-quality conserved sites:
Formula:
MI(X,Y) = H(X) + H(Y) - H(X,Y)
Filtering criteria:
High MI (> 1.0):
Example from IRED analysis:
Conclusion: Position 63 is a hub → likely catalytic center
# Example: Map to IR08 enzyme
python3 scripts/map_conserved_sites.py \
--reference qc_results/analysis/ \
--query IR08.fasta \
--output IR08_mapping.json
# Generate figures
python3 scripts/generate_enzyme_figures.py \
--mapping IR08_mapping.json \
--output figures/IR08/
Output figures:
qc_results/
├── sequences/
│ ├── 01_length_filtered.fasta
│ ├── 02_cdhit_90.fasta
│ ├── 03_complexity_checked.fasta
│ └── 04_motif_checked.fasta
├── alignment/
│ ├── 05_aligned.fasta
│ └── 06_trimmed.fasta
├── analysis/
│ ├── alignment_analysis.json
│ ├── gap_ratios.json
│ ├── highly_conserved_positions.txt
│ ├── coevolution_analysis.json
│ └── coevolution_top50.csv
├── logs/
│ ├── qc_analysis_YYYYMMDD_HHMMSS.log
│ └── mafft.log
└── figures/
├── qc_pipeline.png
├── conservation_quality.png
├── coevolution_network.png
├── figure_nature_01_conservation_landscape.png
├── figure_nature_01_conservation_landscape.pdf
└── ... (12+ figures)
Always check gap ratio for conserved sites!
❌ Bad example:
Position 5: Gap 99.9%, Entropy 0.000
→ This is NOT a real conserved site!
✅ Good example:
Position 8: Gap 2.2%, Entropy 0.012
→ This is a high-quality conserved site!
Required:
Why: These tools are battle-tested and widely accepted in publications.
# ✅ Correct
mafft --localpair input.fasta 1> output.fasta 2> mafft.log
# ❌ Wrong (output contaminated)
mafft --localpair input.fasta > output.fasta
Symptom: Alignment file contains log messages
Solution:
mafft --localpair input.fasta 1> output.fasta 2> mafft.log
Symptom: Conserved sites have gap > 50%
Solution: These are NOT real conserved sites. Filter them out:
high_quality_sites = [s for s in conserved_sites if s['gap_ratio'] < 0.1]
Symptom: Average identity < 20%
Interpretation: This is normal for highly diverse protein families. Not a problem if:
Solution: Use the dedicated Nature-style script:
python3 scripts/generate_nature_conservation_landscape.py
Skill version: 5.0.0
Last updated: 2026-05-08
Status: Production-ready
Quality: Publication-grade
Based on real research:
This skill provides:
Perfect for:
Start using:
python3 scripts/run_complete_qc.py --input your_sequences.fasta --output results/