http://e.biohackers.net/%ED%8C%8C%EC%9D%B4%EC%8D%AC%EC%9C%BC%EB%A1%9C_%EB%A9%94%ED%83%80%EC%A7%80%EB%86%88_%EA%B8%B0%EB%B3%B8_%ED%94%84%EB%A1%9C%ED%8C%8C%EC%9D%BC%EB%A7%81_%EB%94%B0%EB%9D%BC%ED%95%98%EA%B8%B0 ================================================================================ - Suppose you have megenomics NGS data - And you want to do profiling (annotation) on that NGS data into Pandas data form ================================================================================ FASTA=convert(FASTQ_data) output=BLAST(FASTA) output includes "plant spicies identify number" and "plant sequence data" ================================================================================ NCBI SRA Toolkit trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software In the case of me, following directory contains fastq-dump bin file /home/young/Downloads/sratoolkit.2.9.6-1-ubuntu64/bin # Create new directory cd NGS_test/ # Run following to download NGS data /home/young/Downloads/sratoolkit.2.9.6-1-ubuntu64/bin/fastq-dump \ SRR041657 \ --split-files \ --fasta 60 - Used data: SRX019702 (Metagenomics of human microbiome) which had been opened in 2010 - SRX019702 data is composed of 4 number of "run data", SRR04165~7 - Meaning of above command: download run data SRR041657 in FASTA format - split-files: separate files per "pair" So, if there are 10 number of pairs, you will get 10 number of files ================================================================================ Install BLAST sudo apt-get install ncbi-blast+ # Reference sequence database: swiss prot mkdir -p /NGS_test/swiss_prot wget ftp://ftp.ncbi.nlm.nih.gov/blast/db//swissprot.tar.gz tar zxvf swissprot.tar.gz ================================================================================ Run BLAST blastx \ -db /home/young/NGS_test/swiss_prot_db/swissprot \ -max_target_seqs 1 \ -evalue 0.001 \ -outfmt '6 std sgi staxids' \ < /home/young/NGS_test/SRR041657_1.fasta \ > /home/young/NGS_test/result/SRR041657_1.blout -max_target_seqs 1: you use only best match using only best match makes sense because you're doing profiling -outfmt '6 std sgi staxids: output format, print following data at the same time, "6th basic std", "GenBank's ID gi", "NCBI Taxon ID" SRR041657_1.fasta: query sequence (input data) swissprot: reference database ================================================================================ Profiling per "living creature" and "gene" import pandas as pd from subprocess import Popen, PIPE def len_fasta(filename): p1 = Popen(['cat', filename], stdout=PIPE, stderr=PIPE) p2 = Popen(['grep', '>'], stdin=p1.stdout, stdout=PIPE, stderr=PIPE) p3 = Popen(['wc', '-l'], stdin=p2.stdout, stdout=PIPE, stderr=PIPE) result, err = p3.communicate() if p3.returncode != 0: raise IOError(err) return int(result.strip().split()[0]) def show_report(fasta_filename): print("Number of reads: {}".format(len_fasta(fasta_filename))) blout_filename = fasta_filename.replace('fasta', 'blout') data = pd.read_table(blout_filename, names='qseqid sseqid pident length mismatch gapopen qstart ' \ 'qend sstart send evalue bitscore sgi staxids'.split(), ) print("Number of annotations: {}".format(len(data))) only_first_taxon = lambda s: s.split(';')[0] data['staxids2'] = data['staxids'].apply(only_first_taxon) subdata = data[['qseqid', 'staxids2', 'sgi']] taxon_counts = subdata.groupby('staxids2').count().sort('qseqid', ascending=False) print("Number of taxons: {}".format(len(taxon_counts))) print("Top 10 taxons") for tax_id, taxon in taxon_counts.iloc[:10].iterrows(): print(" - id {} has {} reads".format(tax_id, taxon['qseqid'])) gene_counts = subdata.groupby('sgi').count().sort('qseqid', ascending=False) print("Number of genes: {}".format(len(gene_counts))) print("Top 10 genes") for gene_id, gene in gene_counts.iloc[:10].iterrows(): print(" - id {} has {} reads".format(gene_id, gene['qseqid'])) if __name__ == '__main__': import argparse parser = argparse.ArgumentParser() parser.add_argument("fasta", help="input FASTA file") args = parser.parse_args() show_report(args.fasta) ================================================================================ Number of reads of FASTA file is large So, loading large FASTA file can be slow in Python ================================================================================ So, this code used "wc -l" subprocess ================================================================================ See len_fasta() It's following command in Python "cat fastafile | grep '>' | wc -l" ================================================================================ Run example python3 metagenome_profiler.py -h usage: metagenome_profiler.py [-h] fasta positional arguments: fasta input FASTA file optional arguments: -h, --help show this help message and exit ================================================================================ - How to run python3 metagenome_profiler.py SRR041657_1.fasta - Output Number of reads: 163642 Number of annotations: 1574 Number of taxons: 361 Top 10 taxons - id 226186 has 304 reads - id 435590 has 174 reads - id 272559 has 118 reads - id 224308 has 71 reads - id 295405 has 57 reads - id 83333 has 55 reads - id 435591 has 28 reads - id 242619 has 26 reads - id 3702 has 23 reads - id 71421 has 21 reads Number of genes: 1190 Top 10 genes - id 81740756 has 32 reads - id 597502304 has 8 reads - id 33301170 has 7 reads - id 81315088 has 7 reads - id 37538299 has 6 reads - id 1709284 has 5 reads - id 54035840 has 5 reads - id 55976660 has 5 reads - id 81444500 has 5 reads - id 61227671 has 5 reads ================================================================================ If you have multiple sample data, you can perform "profiling" on each sample And you can compare samples based on profiled report ================================================================================ In human gut, there are 2 major microbiomes - Bacteroides thetaiotaomicron VPI-5482(226186) - Bacteroides vulgatus ATCC 8482(435590) Gene which is most frequently observed is "SUSC_BACTN (Starch-utilization system protein C)" which is gene of "Bacteroides thetaiotaomicron VPI-5482(226186)" ================================================================================ Question How above pattern can be different in each person?