https://www.youtube.com/watch?v=TdAA1MFSjog&list=PLaE61CK5r6_l2fxVp3r3OP0fgTSTdQUoQ ================================================================================ BLAST: Basic Local Alignment Search Tool Biologically homology search for protein and necleotide - blastn: for nucleotide - blastp (blastx): for protein (6 possible translations) ================================================================================ ================================================================================ ================================================================================ Result from BLAST Length of query (protein or nucleotide) sequence: around 270 bp - HSP: high score position ================================================================================ Install BLAST sudo apt-get install ncbi-blast+ # Download data file wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/swissprot.gz # Uncompress data file gunzip swissprot.gz # Create data format which can be consumed in BLAST makeblastdb -in swissprot -dbtype prot # outfmt: output format is 7th one # query.fasta: input file # query.blout: output file blastx -db swissprot -outfmt 7 < query.fasta > query.blout ================================================================================ Formatting options - 0: traditional format, showing all alignments - 5,6,7: much used ================================================================================ Parse output file import sys from Bio.Blast import NCBIStandalone blast_parser=NCBIStandalone.BlastParser() for record in NCBIStandalone.Iterator(sys.stdin,blast_parser): for alignment in record.alignments: for hsp in alignment.hsps: print(alignment.title) print(alignment.length) print(alignment.expect) print(alignment.query) print(alignment.match) print(alignment.sbjct) ================================================================================ Result from BLAST ================================================================================ import sys from Bio.Blast import NCBIXML for record in NCBIXML.parse(sys.stdin): for alignment in record.alignments: for hsp in alignment.hsps: print(alignment.title) print(alignment.length) print(alignment.expect) print(alignment.query) print(alignment.match) print(alignment.sbjct) ================================================================================ Quiz - You obtain result from BLAST - You insert that result into pipeline - You shuld get summrization of BLAST result - Contraint: expected_value<0.005 - Run example python blast_summary.py 0.005 < my_quesry.blout.pairwise - Summarization example: ================================================================================ Subprocess Subprocess Python module allows you to spawn new processes, you to connect to their input/output/error pipes, and you can obtain their return codes from subprocess import Popen,PIPE # Create new process, run command, prepare "pipe" to store result text p=Popen(['ls','-la'],stdout=PIPE) # Get result string from process output=p.communicate()[0] ================================================================================ ps -ef | grep zsh result of "ps -ef" becomes "input" to "grep zsh" ================================================================================ p1=Popen(['ls','-al'],stdout=PIPE) p2=Popen(['grep','zsh'],stdin=p1.stdout,stdout=PIPE) p1.stdout.close() output=p2.communicate()[0] ================================================================================ from Bio import SeqIO from subprocess import Popen,PIPE my_seq_record=SeqIO.parse("query.fasta","fasta") input_query=my_seq_record.format("fasta") p=Popen(["blastx","-db","swissprot","-outfmt","6"],stdin=input_query) output=p.communicate()[0] number_of_hsps=len(output.splitlines()) ================================================================================ Quiz - You can search data in nucleotide_NCBI by using "specific given keyword from user" - You obtain 10 number of data - You search above data from swissprot_DB by using blastx - You should answer how many number of proteins are - Constraint: E-value < $$$10^{-5}$$$, duplicated number of protein is not allowed - Run example: python count-concerned-proteins.py "apoptosis" # 356 ================================================================================ Implementation vi search.py import sys from Bio import Entrez from Bio import SeqIO from subprocess import Popen,PIPE from Bio.Blast import NCBIStandalone StringIO ================================================================================ keyword=sys.argv[1] handle=Entrez.esearch(db="nucleotide",term=keyword) result=Entrez.read(handle) idlist=result["IdList"] ================================================================================ blast_parser=NCBIStandalone.BlastParser() ================================================================================ handle=Entrez.efetch(db="nucleotide",id=idlist,rettype="gb",retmode="text") for record in SeqIO.parse(hanlde,"gb"): # print(record.format("fasta")) input=record.format("fasta") p=Popen(['blastx','-db','swissprot','-outfmt','0'],stdin=input) output=p.communicate()[0] ================================================================================ total=0 for result in NCBIStandalone.Iterator(StringIO(output),blast_parser): for alignment in result.alignments: first_hsp=alignmentts.hsps[0] if first_hsp<0.0005: total+=len(result.alignments) ================================================================================ # Run python search.py apoptosis