# BLAST Sequence Similarity Search Search for similar sequences in NCBI databases using BLAST (Basic Local Alignment Search Tool) via OpenBio API. ## When to Use Use BLAST tools when: 1. Identifying an unknown protein or nucleotide sequence 2. Finding homologous sequences across species 3. Checking if a designed sequence has known relatives 4. Annotating a sequence by finding similar characterized sequences 5. Validating sequence identity against a reference database ## Decision Tree ``` What type of sequence do you have and what are you searching? │ ├─ Protein sequence → search protein DB? │ └─ blastp + nr/swissprot/pdb/refseq_protein │ ├─ Nucleotide sequence → search nucleotide DB? │ └─ blastn + nt/refseq_rna/refseq_genomic │ ├─ Nucleotide sequence → search protein DB? │ └─ blastx + nr/swissprot/pdb/refseq_protein │ (translates your nucleotide query in all 6 frames) │ ├─ Protein sequence → search nucleotide DB? │ └─ tblastn + nt/refseq_rna/refseq_genomic │ (translates DB sequences in all 6 frames) │ └─ Nucleotide vs translated nucleotide? └─ tblastx + nt/refseq_rna/refseq_genomic (translates both query and DB) ``` ### Database Selection | Database | Type | Contents | Best For | |----------|------|----------|----------| | nr | Protein | Non-redundant protein sequences | Broadest protein search | | swissprot | Protein | Curated UniProt entries | High-quality annotated proteins | | pdb | Protein | Protein Data Bank sequences | Finding structural homologs | | refseq_protein | Protein | NCBI RefSeq proteins | Reference protein sequences | | nt | Nucleotide | Non-redundant nucleotide | Broadest nucleotide search | | refseq_rna | Nucleotide | NCBI RefSeq RNA | Reference transcripts | | refseq_genomic | Nucleotide | NCBI RefSeq genomes | Reference genomic sequences | **Important**: Database must be compatible with the BLAST program. Protein programs (blastp, blastx) require protein databases. Nucleotide programs (blastn, tblastx) require nucleotide databases. tblastn uses nucleotide databases but takes a protein query. ## Tools Reference ### submit_blast — Submit a BLAST search Submits a sequence to NCBI BLAST and returns a Request ID (RID) for polling. ```bash # Protein BLAST against SwissProt curl -X POST "https://api.openbio.tech/api/v1/tools" \ -H "X-API-Key: $OPENBIO_API_KEY" \ -F "tool_name=submit_blast" \ -F 'params={"query": "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH", "program": "blastp", "database": "swissprot", "evalue": 0.001, "max_hits": 10}' ``` ```bash # Nucleotide BLAST against nt curl -X POST "https://api.openbio.tech/api/v1/tools" \ -H "X-API-Key: $OPENBIO_API_KEY" \ -F "tool_name=submit_blast" \ -F 'params={"query": "ATGGTTCTGTCTAAGCCCGATGACAAAACCAACGTGAAAGCAGCCTGGGGA", "program": "blastn", "database": "nt"}' ``` **Parameters**: | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | query | string | Yes | — | Raw sequence (10–10,000 characters, no FASTA header) | | program | string | Yes | — | blastn, blastp, blastx, tblastn, or tblastx | | database | string | No | nr | Target database (must match program type) | | evalue | float | No | 10.0 | E-value threshold (lower = more stringent) | | max_hits | int | No | 10 | Max hits to return (1–50) | **Returns**: `rid` (Request ID), `program`, `database`, `query_len` ### check_blast_status — Poll for completion BLAST searches take 10–60+ seconds. Poll this endpoint until status is `READY`. ```bash curl -X POST "https://api.openbio.tech/api/v1/tools" \ -H "X-API-Key: $OPENBIO_API_KEY" \ -F "tool_name=check_blast_status" \ -F 'params={"rid": "YOUR_RID_HERE"}' ``` **Parameters**: | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | rid | string | Yes | Request ID from submit_blast | **Returns**: `status` — one of: - `WAITING` — still running, poll again in 10–15 seconds - `READY` — results available, call get_blast_results - `FAILED` — search failed, resubmit - `UNKNOWN` — unexpected state ### get_blast_results — Retrieve results Fetch parsed results once status is `READY`. ```bash curl -X POST "https://api.openbio.tech/api/v1/tools" \ -H "X-API-Key: $OPENBIO_API_KEY" \ -F "tool_name=get_blast_results" \ -F 'params={"rid": "YOUR_RID_HERE", "max_hits": 20}' ``` **Parameters**: | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | rid | string | Yes | — | Request ID (must have READY status) | | max_hits | int | No | 10 | Max hits to return (1–50) | **Returns**: `query_title`, `query_len`, `program`, `database`, `total_hits`, `hits` array. Each hit includes: - `accession`, `title`, `scientific_name`, `taxid` - `evalue`, `bit_score`, `identity_pct`, `align_len` - `query_start`, `query_end`, `hit_start`, `hit_end`, `gaps` ## Quality Thresholds ### E-value Interpretation | E-value | Interpretation | |---------|----------------| | < 1e-50 | Excellent match — near-identical or close homolog | | 1e-50 to 1e-10 | Strong match — likely homologous | | 1e-10 to 0.01 | Moderate — possible remote homology | | 0.01 to 10 | Weak — may be spurious, inspect alignment | | > 10 | Not significant — likely random | ### Identity Percentage | Identity % | Interpretation | |------------|----------------| | > 90% | Very high — same protein/gene or very close ortholog | | 70–90% | High — clear homolog, likely conserved function | | 30–70% | Moderate — probable homolog, function may diverge | | < 30% | Low — remote homology or convergent similarity | ### Bit Score Higher bit scores indicate better alignments. Bit scores > 50 are generally significant. Unlike e-values, bit scores are independent of database size. ## Common Workflow ``` 1. Submit BLAST search → submit_blast with sequence, program, database → Save the returned RID 2. Poll for completion (wait 10-15 seconds between checks) → check_blast_status with RID → Repeat until status is READY 3. Retrieve results → get_blast_results with RID → Parse hits by e-value, identity, bit score 4. Analyze top hits → Check species, annotations, and alignment quality → Use other OpenBio tools for follow-up: - lookup_gene for gene details - fetch_pdb_metadata for structural info - get_sequence for full sequence retrieval ``` ## Common Mistakes ### Wrong: Mismatched program and database ``` ❌ submit_blast with program="blastp" and database="nt" → Protein program cannot search nucleotide database ``` ``` ✅ Match program type to database type: - blastp/blastx → protein databases (nr, swissprot, pdb) - blastn/tblastx → nucleotide databases (nt, refseq_rna) - tblastn → nucleotide databases (protein query, translated DB) ``` ### Wrong: Not waiting for completion ``` ❌ Calling get_blast_results immediately after submit_blast → Search hasn't finished yet ``` ``` ✅ Always check status first: → check_blast_status until READY → Then get_blast_results ``` ### Wrong: Using too permissive e-value ``` ❌ submit_blast with evalue=10 for identifying a protein → Returns many spurious hits ``` ``` ✅ Use stringent e-value for identification: → evalue=0.001 or lower for confident matches → Only use evalue=10 for exhaustive remote homology searches ``` ## Troubleshooting | Issue | Cause | Fix | |-------|-------|-----| | Status stays WAITING | Large database or busy server | Wait longer (up to 5 min), poll every 15 seconds | | Status FAILED | Invalid sequence or server error | Verify sequence is valid, resubmit | | No hits returned | Sequence too short or novel | Try broader database (nr/nt), increase e-value | | Too many low-quality hits | Permissive e-value | Lower e-value threshold (e.g., 0.001) | | Program/database error | Incompatible combination | See database selection table above | | Sequence rejected | Too short (< 10) or too long (> 10,000) | Trim or split sequence to 10–10,000 characters | --- **Tip**: For quick protein identification, use `blastp` against `swissprot` with `evalue=0.001`. SwissProt is smaller but curated, giving faster and more annotated results than `nr`.