Navigation

  • index
  • modules |
  • next |
  • previous |
  • PyPhyloGenomics 0.1.0 documentation »

1. Getting started with PyPhyloGenomics: Gene Search¶

Some snippets of code to get you started with writing code using PyPhyloGenomics.

1.1. Finding candidate genes from Bombyx mori¶

We need to obtain candidate genes to be used in phylogenetic inference that have to fulfill the following requirements:

  • Our genes should be orthologs.
  • Our genes should be single-copy genes.
  • Their sequence need to be around 251 DNA base pairs in length.

We will assume that our Next Generation Sequencer available is the IonTorrent.

We have to consider the IonTorrent platform requirements to arrive to our target 250bp gene length:

Primer Length (bp)
Adapter A 30
5’ Index 8
5’ Degenerate Primer 25
Exon ???
3’ Degenerate Primer 25
3’ Index 8
Adapter P 23

For IonTorrent Platform2, the maximum length that can be sequenced is from 280bp to 320bp in total. Thus, 320 - 119 = 201 is the maximum internal gene region (region within degenerate primers).

Therefore, for the new set of primers, being designed for Platform2, we have a maximum amplicon size of 201 + 25*2 = 251bp.

The OrthoDB database has a catalog of orthologous protein-coding genes for vertebrates, arthropods and other living groups.

For this quick getting-started guide, we will download the table of orthologous genes for Arthropoda from OrthoDB’s ftp server OrthoDB6_Arthropoda_tabtex.gz.

Starting off:

>>> import pyphylogenomics
>>> from pyphylogenomics import OrthoDB

To see the full documentation you can use:

>>> help(pyphylogenomics) # shows list of modules
>>> help(pyphylogenomics.OrthoDB) # shows help for a specific module and its functions

This also works:

>>> help(OrthoDB) # shows help for a specific module and its functions

We will find all single-copy genes for the silk moth Bombyx mori using the table from OrthoDB as input file:

>>> in_file = 'OrthoDB6_Arthropoda_tabtext'
>>> genes = OrthoDB.single_copy_genes(in_file, 'Bombyx mori')
...
Found gene: BGIBMGA011628
Found gene: BGIBMGA002142
Found gene: BGIBMGA014116
Found 12167 genes.

The variable genes is a list of IDs for single-copy genes extracted from the OrthoDB table:

>>> genes;
...
'BGIBMGA012533', 'BGIBMGA014144', 'BGIBMGA011628', 'BGIBMGA002142', 'BGIBMGA014116',
'BGIBMGA000220', 'BGIBMGA007580', 'BGIBMGA013398', 'BGIBMGA011517', 'BGIBMGA011820',
'BGIBMGA014318', 'BGIBMGA013470', 'BGIBMGA011304', 'BGIBMGA005643', 'BGIBMGA002698']
>>> len(genes);
12167

We will use these gene IDs to obtain the gene sequences from the Bombyx mori genome. The genome can be downloaded from silkdb.org. We will download the Consensus gene set by merging all the gene sets using GLEAN: Fasta file silkworm_glean_cds.fa.tar.gz.

Untar the gzipped file and you will get the FASTA formatted file silkcds.fa containing gene IDs and sequences for Bombyx mori.

We will need to BLASTn these single-copy genes against the Bombyx mori genome in order to get exon sizes:

  1. Pull all sequences for our gene IDs from the CDS file and write them to a file pulled_seqs.fasta:

    >>> from pyphylogenomics import BLAST
    >>> cds_file = "silkcds.fa"
    >>> BLAST.get_cds(genes, cds_file)
    12167  sequences were written to file pulled_seqs.fasta
    
  2. Download the Bombyx mori genome from silkdb.org (download the file silkworm_genome_v2.0.fa.tar.gz). Unzip and untar the file to your working directory and you will get the file silkgenome.fa

  3. Do a BLASTn of the sequences against the Bombyx mori genome. The input arguments are your file containing the sequences for single-copy genes (pulled_seqs.fasta) and your file with the genome of Bombyx mori which is in FASTA format (silkgenome.fa).

    >>> BLAST.blastn('pulled_seqs.fasta', 'silkgenome.fa')
    ...
    BLASTn finished!
    The BLAST results were written in to the file  pulled_seqs_blastn_out.csv
    

The file pulled_seqs_blastn_out.csv contains a BLAST output table with the blast results. PyPhyloGenomics has functions to filter out the table and get information to answer the following:

  • Which are the longest gene-sequence to genome matches?
  • Which are the genes that are “distantly enough” from each other? So that, they can used as independent evolutionary entities?
  1. As stated before, we prefer long exons for each of the candidate genes ( > 300 nucleotides):

    >>> exons = BLAST.getLargestExon("pulled_seqs_blastn_out.csv", E_value=0.001, ident=98, exon_len=300)
    Parsing BLAST table ...
    Deleting exons below 300 nucleotides ...
    There are 7411 exons
    
  2. Some small segments of sequences might match exons present in more than one scaffold. We will use the function eraseFalsePosi to keep those matches of longest length:

    >>> exons = BLAST.eraseFalsePosi(exons) # Drop presumable false positives.
    Erasing False Positives ...
    There are 6346 exons
    
  3. Ideally we want exons that are not too close to each other in the genome to avoid gene linkage. So we will keep only those exons that are apart by 810 kilobases:

    >>> exons = BLAST.wellSeparatedExons(exons) # Keep exons separated by > 810KB
    Identifying exons separated by 810000 bases ...
    There are 574 exons
    
  4. Finally we can use a function to save the obtained exons while making sure they are in frame. We need to use as additional arguments the genome file and output filename:

    >>> BLAST.storeExonsInFrame(exons, "pulled_seqs.fasta", "Bombyx_exons.fas")
    Storing exons ...
    A total of 574 exons are kept
    These exons have been stored in the file: Bombyx_exons.fas
    

1.2. Validation of exon structure¶

We have now 574 single copy exons extracted from the Bombyx mori genome. Let’s find out whether these exons are conserved in other Arthropoda species.

For example we can compare these 574 exons with the genome of the monarch butterfly Danaus plexippus.

1.2.1. Danaus plexippus¶

  1. Download the version two of the monarch butterfly genome from here: http://danaus.genomeprojectsolutions-databases.com/Genome_seq_stats.html

  2. Extract the genome as FASTA file using gunzip:

    • gunzip Dp_genome_v2.fasta.gz
  3. Do a blastn of our Long Exons against the Danaus genome:

    >>> BLAST.blastn("Bombyx_exons.fas", "Dp_genome_v2.fasta");
    ...
    BLASTn finished!
    The BLAST results were written in to the file Bombyx_exons_blastn_out.csv
    
  4. We need to parse the output blast table and extract the exons from Danaus that are longer than 300bp and are homologous to the exons of Bombyx mori.

    >>> BLAST.blastParser("Bombyx_exons_blastn_out.csv", "Dp_genome_v2.fasta", "Danaus_exons.fasta", sp_name="Danaus")
    Reading files ...
    Parsing BLAST table ...
    A total of 158 sequences passed the thresholds.
    They have been stored in the file: Danaus_exons.fasta
    

The parameter sp_name is important as it will be used as part of the exons IDs.

1.2.2. Heliconius melpomene¶

  1. We can continue finding homologous exons in other related butterflies. For example Heliconius melpomene.

  2. Download the genome from here: http://metazoa.ensembl.org/Heliconius_melpomene/Info/Index

  3. Extract the genome as FASTA file:

    • gunzip Heliconius_melpomene.Hmel1.17.dna_rm.toplevel.fa.gz
    • mv Heliconius_melpomene.Hmel1.17.dna_rm.toplevel.fa Heliconius_genome.fa
  4. BLASTn the Bombyx mori exons against the Heliconius genome:

    >>> BLAST.blastn("Bombyx_exons.fas", "Heliconius_genome.fa");
    ...
    BLASTn finished!
    The BLAST results were written in to the file  Bombyx_exons_blastn_out.csv
    
  5. Parse the blast table, extract the exon sequences and save them to a file:

    >>> BLAST.blastParser("Bombyx_exons_blastn_out.csv", "Heliconius_genome.fa", "Heliconius_exons.fasta", sp_name="Heliconius")
    Reading files ...
    Parsing BLAST table ...
    A total of 145 sequences passed the thresholds.
    They have been stored in the file: Heliconius_exons.fasta
    

1.2.3. Manduca sexta¶

  1. Repeating the procedure for the tobacco hornworm.

  2. Download the genome from ftp://ftp.bioinformatics.ksu.edu/pub/Manduca/

  3. We downloaded the file Msex05162011.genome.fa.

  4. Blasted the Bombyx mori exons against the Manduca genome:

    >>> BLAST.blastn("Bombyx_exons.fas", "Msex05162011.genome.fa")
    ...
    BLASTn finished!
    The BLAST results were written in to the file  Bombyx_exons_blastn_out.csv
    
  5. Parsing the output blast table:

    >>> BLAST.blastParser("Bombyx_exons_blastn_out.csv", "Msex05162011.genome.fa", "Manduca_exons.fasta", sp_name="Manduca")
    Reading files ...
    Parsing BLAST table ...
    A total of 219 sequences passed the thresholds.
    They have been stored in the file: Manduca_exons.fasta
    

1.3. Small break¶

A quick summary of the work so far:

  1. We obtained a list of orthologous and single copy genes by parsing the dataset for Arthopod genes from OrthoDB.
  2. From those genes, we took the exon sequences for Bombyx mori from its Coding DNA Sequences (CDS) from silkdb.org.
  3. We want to be sure that there are no introns inside our candidate exons. So we blasted the CDS sequences against the Bombyx mori genome.
  4. We filtered those exons that were longer than 300 bp, were separated by 810 kilobases and got them inframe.
  5. We did massive blasting of these selection of exons against genomes of related species: Danaus plexippus, Heliconius melpomene and Manduca sexta.
  6. We got one FASTA file with the homologous regions for each species genome.
  7. Now, we will proceed to align all those homologous exons in order to design primers.
  8. Thus, we will be albe to sequence these exons accross a wide range of species in the order Lepidoptera.

1.4. Exon Alignment¶

We will use our module MUSCLE to do the alignment. We need to use as input a python list of the filenames that contain the exons of each species. All aligned sequences will be written into a folder called alignments as FASTA files (one file per exon).

Warning

In the list of files, we will put FIRST the file for Bombyx mori, so that it will be used as “master” file. This is because the script will look for sequences in other files that appear in the file for Bombyx mori.

Example:

>>> from pyphylogenomics import MUSCLE
>>> files = ['Bombyx_exons.fas', 'Danaus_exons.fasta','Heliconius_exons.fasta','Manduca_exons.fasta']
>>> MUSCLE.batchAlignment(files)
...
Pooling gene BGIBMGA000851:1-597
Pooling gene BGIBMGA010204:1-516
132 alignments have been saved in the folder "alignments"

Warning

It is always recommended to check by eye every alignment that has been generated by any software. Once you are sure that the alignment is correct, we can continue with the analysis.

1.5. Primer design¶

Now that we have our exons/genes from several species (Bombyx, Manduca, Danaus and Heliconius in this example), we can design primers in order to sequence these genes across a wide range of butterflies and/or moths.

Since we have 132 candidate genes to design primers for, we can automate the primer design using a nice tool available in PyPhyloGenomics.

The function designPrimers will send an alignment to primers4clades along (with some parameters) and do a request for primer design. This function will return the degenerate primers as estimated by primers4clades.

It is recommended that you enter your email as one of the parameters so that primers4clades can send you an email with very detailed results for you to insect. Just in case you don’t provide your email, the very detailed results will be saved in the same folder of your alignments as HTML files.

Warning

Please keep in mind that the designPrimers function will return very little data, i.e. only the forward and reverse primers for an alignment. But it might be necessary that you inspect the detailed information saved as HTML files or the emails sent to you by primers4clades. You will receive one email for each alignment.

Before primer design it could be useful to trim off the ends of the sequences so that all sequences will have the same length:

>>> from pyphylogenomics import MUSCLE
>>> MUSCLE.bluntSplicer("alignments/") # folder_path containing the FASTA file alignments

This will produce FASTA files ending in “_bluntlySpliced.fasta”. You may want to remove the old unspliced FASTA files before doing primer design.

Automated primer design via primers4clades:

  • Alignment in FASTA format containing at least 4 sequences.

  • Several parameters:

    • temperature
    • minimium amplicon length
    • maximum amplicon length
    • genetic code
    • cluster type
    • substitution model
    • email address

Example: The values shown are the default. Change them if needed.

>>> from pyphylogenomics import MUSCLE
>>> folder = "alignments"   # folder containing the FASTA file alignments
>>> tm = "55"               # annealing temperature
>>> min_amplength = "250"   # minimium amplicon length
>>> max_amplength = "500"   # maximum amplicon length
>>> gencode = "universal"   # see below for all available genetic codes
>>> mode  = "primers"
>>> clustype = "dna"
>>> amptype = "dna_GTRG"    # substitution model used to estimate phylogenetic information
>>> email = "youremail@email.com"   # primer4clades will send you an email with very detailed results
>>> MUSCLE.designPrimers(folder, tm, min_amplength, max_amplength, gencode, mode, clustype, amptype, email)
...
Done.
All primers have been saved in the file "primers.fasta"

All primers will be saved to a file (primers.fasta). However, it is recommended that you study the very detailed output saved into your alignments folder as HTML files so that you can decide to use these primers or not.

Table Of Contents

  • 1. Getting started with PyPhyloGenomics: Gene Search
    • 1.1. Finding candidate genes from Bombyx mori
    • 1.2. Validation of exon structure
      • 1.2.1. Danaus plexippus
      • 1.2.2. Heliconius melpomene
      • 1.2.3. Manduca sexta
    • 1.3. Small break
    • 1.4. Exon Alignment
    • 1.5. Primer design

Previous topic

Welcome to PyPhyloGenomics’s documentation!

Next topic

2. Analysis of raw data output from IonTorrent

This Page

  • Show Source

Quick search

Enter search terms or a module, class or function name.

Navigation

  • index
  • modules |
  • next |
  • previous |
  • PyPhyloGenomics 0.1.0 documentation »
© Copyright 2013, Carlos Peña, Victor Solis, Pavel Matos, Chris Wheat. Created using Sphinx 1.1.3.