Welcome to PyPhyloGenomics’s documentation!


PyPhyloGenomics is a Python toolkit for developing phylogenomic markers in novel species for Next Generation sequence data.


It includes tools for:

  • discovery of homologous genes from species genomes.
  • primer design.
  • analysis of Next Generation Sequencing data, and
  • gene sequences assembly.


Download the source code from github.



This module uses the table OrthoDB6_Arthropoda_tabtext.csv downloaded from OrthoDB (the database of orthologous groups) ftp://cegg.unige.ch/OrthoDB6 to extract certain features and objects described below.


* Internal function *

Creates a dictionary with the number of copies per gene and per species.

The dictionary has as keys tuples=(species_name, gene_name) and values=integer that represent the number of copies of that gene in the species.

OrthoDB.copies_per_gene_table(in_file, out_file)

Stores the number of copies a gene has per species in a file.

The output file is a table, where each row is a gene, and each columm is a species. The values in the table are the number of gene copies. Note: the first row in the otput table is the number of single-copy genes in a given species.

OrthoDB.single_copy_genes(in_file, species_name)

Returns a list of single-copy genes for a species given by user.

The species name should be in the same format as stated in the input file in_file.

OrthoDB.single_copy_in_species(in_file, gene_name)

Returns the species where the gene_name given by user is a single-copy gene.

The gene name should be in the same format as stated in the input file in_file.


Prepares data and executes BLAST commands to create BLAST database. Performs blastn of user sequences against genomic sequences. Provides functions to parse the generated BLAST tables and extract exons.

BLAST.batch_iterator(iterator, batch_size)

Returns lists of length batch_size.

# Taken from http://biopython.org/wiki/Split_large_file

This can be used on any iterator, for example to batch up SeqRecord objects from Bio.SeqIO.parse(...), or to batch Alignment objects from Bio.AlignIO.parse(...), or simply lines from a file handle.

This is a generator function, and it returns lists of the entries from the supplied iterator. Each list will have batch_size entries, although the final list may be shorter.

BLAST.blastParser(blast_table, sbj_db, out_file, sp_name='homologous', E_value=0.01, ident=75, exon_len=300)

Returns the subjects’ sequences aligned with the queries as long as they pass the thresholds given on the parameters.


>>> BLAST.blastParser("LongExons_out_blastn_out.csv", "Dp_genome_v2.fasta", "Danaus_exons.fasta", sp_name="Danaus");
Reading files ...
Parsing BLAST table ...
A total of 158 sequences passed the thresholds.
They have been stored in the file: Danaus_exons.fasta

The parameter sp_name is important as it will be used as part of the exons IDs.

BLAST.blastn(query_seqs, genome, e_value=1e-05, mask=True)

Performs a BLASTn of user’s sequences against a genome. It will create a BLAST database from the genome file first.

query_seqs argument is a FASTA file of users sequences. genome argument is a FASTA file of a species genome.


Keeps the query-sbj match with the longest alignment number.

From a dictionary generated with the getLargestExon function, where 2 or more query-sbj matches shared the same query, eraseFalsePosi keeps the query-sbj match with the longest alignment.

BLAST.filterByMinDist(genes_loci, MinDist)

Returns those genes that are separated from its precedent neighbour by < MinDist threshold.

The parameter genes_loci is a list of tuples, where each tuple consists of 3 items: a gene_id, start_position and end_position.

BLAST.getLargestExon(blast_table, E_value=0.001, ident=98, exon_len=300)

Returns the highest alignment number for each query-sbj match from a blast table.

BLAST.get_cds(genes, cds_file)

Writes a FASTA file containing CDS sequences: pulled_seqs.fasta

genes argument is a list of gene IDs. cds_file is the file name of predicted CDS for a species in FASTA format.

BLAST.makeblastdb(genome, mask=False)

Creates a BLAST database from a genome in FASTA format and optionally eliminates low-complexity regions from the sequences.

BLAST.storeExonsInFrame(exons_dict, queries_db, out_file)

Strips the exon’s ends, so that it is in frame, and then stores the results in a file.

Strip the exon’s first and last residues so that they correspond to the start and end of a codon, respectively. Then stores the stripped exons in a file.

BLAST.wellSeparatedExons(exons_dict, MinDist=810000)

Keeps the exons whose distance to the following exon is > MinDist.

From a dictionary generated with the getLargestExon function, where 2 or more query-sbj matches shared the same sbj, wellSeparatedExons keeps the query-sbj match whose distance to the following query-sbj match is greater than MinDist.


Reads in sequences from files, group homologous sequences based on gene IDs and aligns them.


Reads in sequences from files, group homologous sequences and aligns them.

files a list of file names. The first file in the list should be the master file (e.g. exon sequences of Bombyx mori).


>>> from pyphylogenomics import MUSCLE
>>> files = ['Bmori_exons.fasta', 'Danaus_exons.fasta','Heliconius_exons.fasta','Manduca_exons.fasta']
>>> MUSCLE.batchAlignment(files)

All aligned sequences will be written into a folder called alignments as FASTA files (one file per exon).

MUSCLE.bluntSplicer(folder_path, window=20)

Splices Multiple Sequence Alignments objects found in folder_path. Objects should be in FASTA format. Gaps from both flanks of the alignments will be trimmed.

Window size is used to find gaps in flanks (default 20 nucleotides). bluntSplicer reads the MSA files from folder_path, and stores the spliced MSAs in the same folder.


>>> from pyphylogenomics import MUSCLE
>>> MUSCLE.bluntSplicer("alignments/") # folder_path containing the FASTA file alignments
MUSCLE.designPrimers(folder, tm='55', min_amplength='100', max_amplength='500', gencode='universal', mode='primers', clustype='dna', amptype='dna_GTRG', email='')

It will send a FASTA alignment to primers4clades in order to design degenerate primers. Input data needed:

  • Alignment in FASTA format containing at least 4 sequences.

  • Several parameters:

    • temperature
    • minimium amplicon length
    • maximum amplicon length
    • genetic code
    • cluster type
    • substitution model
    • email address

Example: The values shown are the default. Change them if needed.

>>> from pyphylogenomics import MUSCLE
>>> folder = "alignments"   # folder containing the FASTA file alignments
>>> tm = "55"               # annealing temperature
>>> min_amplength = "250"   # minimium amplicon length
>>> max_amplength = "500"   # maximum amplicon length
>>> gencode = "universal"   # see below for all available genetic codes
>>> mode  = "primers"
>>> clustype = "dna"
>>> amptype = "dna_GTRG"    # substitution model used to estimate phylogenetic information
>>> email = "youremail@email.com"   # primer4clades will send you an email with very detailed results
>>> MUSCLE.designPrimers(folder, tm, min_amplength, max_amplength, gencode, mode, clustype, amptype, email)

The best primer pairs will be printed to your screen. Detailed results will be saved as HTML files in your alignments folder. But it is recommended if you also get the results by email. primers4clades will send you one email for each alignment.

The genetic code table (variable gencode) can be any of the following:

  • universal for standard
  • 2 for vertebrate mitochondrial
  • 3 for yeast mitochondrial
  • 4 for mold and protozoa mitochondrial
  • 5 for invertebrate mitochondrial
  • 6 for ciliate
  • 9 for echinoderm and flatworm
  • 10 for euplotid nuclear
  • 11 for bacterial and plastid
  • 12 for alternative yeast nuclear
  • 13 for ascidian mitochondrial
  • 14 for flatworm mitochondrial
  • 15 for Blepharisma nuclear
  • 16 for Chlorophycean mitochondrial
  • 21 for Trematode mitochondrial
  • 22 for Scenedesmus obliquus mitochondrial
  • 23 for Thraustochytrium mitochondrial

The evolutionary substitution model can be any of the following (variable amptype):

  • protein_WAGG for protein WAG+G
  • protein_JTTG for protein JTT+G
  • protein_Blosum62G for protein Blosum62+G
  • protein_VTG for protein VT+G
  • protein_DayhoffG for protein Dayhoff+G
  • protein_MtREVG for protein MtREV+G
  • dna_HKYG for dna HKY+G
  • dna_GTRG for dna GTR+G
  • dna_K80G for dna K80+G
  • dna_TrNG for dna TrN+G
  • dna_JC69G for dna JC69+G


Prepares output data from sequencing round in IonTorrent (FASTQ format file).

  • Changes quality format from Phred to Solexa (which is required by the fastx-toolkit).
  • Changes sequences id to incremental numbers.
  • Creates temporal FASTA file.

This module also finds matches between reads and expected gene regions by using BLASTn, and separates reads based on matches against indexes.

It also has functions to do assembly of reads using velvet.

NGS.assembly(fastq_file, index_length, min_quality=20, percentage=70, min_length=50)

Do de novo assembly of expected sequences after doing quality control of a FASTQ file. Quality control includes dropping reads with low quality values, trimming of bad quality end and trimming index region.

  • fastq_file FASTQ format file that ideally has been separated by gene and index using the functions NGS.parse_blast_results() and NGS.separate_by_index()
  • index_length number of base pairs of the indexes. They will be trimmed from the reads during processing.
  • min_quality minimum quality score to keep (20 by default)
  • percentage minimum percent of base pairs that need to have min_quality (70% by default)
  • min_length minimum length of read sequences to keep (50 by default)


>>> from pyphylogenomics import NGS;
>>> fastq_file   = "index_Ion_4_gene_rps5.fastq";
>>> index_length = 8;
>>> min_quality  = 30; # optional
>>> percentage   = 80; # optional
>>> min_length   = 60; # optional
>>> NGS.assembly(fastq_file, index_length, min_quality, percentage, min_length);
NGS.count_reads(fastqFile, file_format)

* Internal function *

NGS.find_index_in_seq(barcode, seq, levenshtein_distance)

* Internal function *

Arguments: barcode, read sequence. Iterate through primer’s degenerated IUPAC and try to find it in sequence read. Return TRUE on success.


* Internal function *


* Internal function *

NGS.levenshtein(a, b)

* Internal function *

Calculates the Levenshtein distance between a and b.

NGS.parse_blast_results(blast_table, ion_file)

This function uses the BLAST results from a CSV file and separates the IonTorrent reads into bins that match each of the target genes (or reference sequences). To speed up things a few tricks are made: * Split CSV file into several chunks. * Split ionfile.fastq into complementary chunks so that the reads in the CSV chunks will be found in our fastq chunks. Reads will be written into separate FASTQ files, one per matching target gene. These files will be written in the folder output.

This step will accept matching reads that align more than 40bp to the expected gene sequence. Function NGS.filter_reads()

NGS.prepare_data(ionfile, index_length)
  • Changes quality format from Phred to Solexa (which is required by the fastx-toolkit).
  • Changes sequences id to incremental numbers.
  • Creates temporal FASTA file with the indexes removed from the sequences.

Files generated will be written to folder data/modified/

  • ionfile argument is FASTQ format file as produced by IonTorrent
  • index_length number of base pairs of your indexes. This is necessary to trim the indexes before blasting the FASTA file against the reference gene sequences.


>>> from pyphylogenomics import NGS
>>> ionfile = "ionrun.fastq";
>>> index_length = 8;
>>> NGS.prepare_data(ionfile, index_length);
Your file has been saved using Solexa quality format as data/modified/wrk_ionfile.fastq
Your sequence IDs have been changed to numbers.
The FASTA format file data/modified/wrk_ionfile.fasta has been created.
NGS.quality_control(fastq_file, index_length, min_quality=20, percentage=70, min_length=50)

* Internal function *

Use fastx-tools to do quality control on FASTQ file.

NGS.split_ionfile_by_results(ion_file, blast_chunk)

* Internal function *

This function divides an IonTorrent run file into chunks that match chunks of a BLAST results file in CSV format.


* Internal function *

Split a CSV format result file from a BLAST in chunks.

Indices and tables