Skip to content

Latest commit

 

History

History
494 lines (398 loc) · 21.9 KB

Mitogenome_analysis.md

File metadata and controls

494 lines (398 loc) · 21.9 KB

Download mitogenomes

IMPORTANT: the names on your genes in the fasta file should be 8 characters (without considering ">")

For example, any sequence downloaded from GenBank look like this (without the " ' ").

'>JX311919.1 Cottus sp. MKY-2013 isolate Bird Creek cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial

You should reduce it to 8 characters (without the " ' "), making sure each one has a different name.

'>Cottus_1

Note: You could use a longer name for your genes according to your project, but make sure to update the code #------------ format into a single line fasta (see below) to the number of characters you are using.

Alignment

Mitochondrial genomes have non-coding (e.g. rRNA or tRNA) and coding sequences (e.g. COI, Cytb, NAD1, etc). You should use TranslatorX and MAFFT to perform alignments, respectively. To make it clearer:

  • MAFFT for non-coding sequences
  • TranslatorX for coding sequences

Mafft

Loop to run mafft in computer (download here an place in the same folder the sequences to align), save it as .sh file

for i in *.fasta
do
    mafft --quiet $i > ${i%.fasta}.aligned.fasta
done

Alignment with TranslatorX

Particularly for coding sequences. There is an online version of TranslatorX http://translatorx.co.uk/ where you could run your .fasta files. Also, there is a local version in pearl to run a lot of files at once. Copy and save it in the folder where your fasta files are. Meaning of each parameter is explained in the script (line 80).

#To run one file
perl translatorx_vLocal.pl -i name_file.fasta -o output_name -p F -c 5 -t T

Loop version: translatorx-loop.sh

dos2unix translatorx-loop.sh
sh translatorx-loop.sh

Then, extract files of your interest, in this case the nucleotide fasta alignment.

mkdir xalign
mv *.nt_ali.fasta xalign/
cd xalign/

Then you should cut the flanks and continue with Gblocks step.

But, wait a second! How to cut the flanks?

Aligned .fasta files preparation for Gblocks

Sometimes after exporting sequences, they are not in one row

Flank cutting

You could cut the flanks with Clustal or MEGA softwares and save it in .fasta format.

Important: Some researches consider the flanks (both extremes of the alignment before a sequence block appears) are informative. However, if most of the sequences doesn't have the same lenghts, when the alignment is processed in further analysis it would compare an extreme of the sequence to "nothing". So, to avoid that, it's preferred to shorten the alignment and focus the analysis where 50% to 75% of the sequences start sharing information, non-considering the gaps within the alignment.

Make sure the sequences are in one row

Sometimes exported .fasta files look like this (without the extra space between the lines and the " " ")

">sequence_1
AYOUR.DNA.OR.PROTEIN..SEQUENCEAAA

AAASDAAAAAAAAAAAAAAAA...........AAAA

TGTGTATGTATGCCCCCCCCC.......THEAAEND

">sequence_2
AYOUR.DNA.OR.PROTEIN..SEQUENCEAAA

AAASDAAAAAAAAAAAAAAAA...........AAAA

TGTGTATGTATGCCCCCCCCC.......THEAAEND

But, some programs only accept .fasta files with sequences in one row that look like this.

">sequence_1 AAYOURAAADNAAAORPROTEINAASEQUENCEAAAAAASDAAAAAAAAAAAAAAAA...........AAAATGTGTATGTATGCCCCCCCCC.......THEAAEND ">sequence_2 AAYOURAAADNAAAORPROTEINAASEQUENCEAAAAAASDAAAAAAAAAAAAAAAA...........AAAATGTGTATGTATGCCCCCCCCC.......THEAAEND

To do that, you could run this bash code

#------------ format into a single line fasta 
#delete enter in all the file
tr -d "\n" < your_sequence.fasta > dummy
#includes an "enter" per specie
sed -i -E "s/>/\n>/g" dummy
#delete first empty row
awk 'NR>1' dummy > dummy2
#add enter to ID, only considers the first eight characters, but you could consider a longer name changing the "8" (e.g. 10) 
sed -e "s/>.\{8\}/&\n/g" < dummy2 > output-cc.fasta
rm dummy dummy2

Gblocks

Installation

Download Gblocks for MAC or Linux https://slackbuilds.org/repository/15.0/academic/Gblocks/

Downoad Gblocks for Windows https://github.com/dongzhang0725/PhyloSuite_plugins

Documentation https://home.cc.umanitoba.ca/~psgendb/doc/Castresana/Gblocks_documentation.html

If you work in a WSL, you could move the compressed file to your programmes folder (or home directory). Then run this

uncompress Gblocks_OS_0.91b.tar.Z
tar xvf Gblocks_OS_0.91b.tar

Run Gblocks

Input files are

  1. Gblock executable program
  2. Aligned (one row per each sequence) fasta files

Both should be in the same folder, to do that you could use the following lines

#Create a folder to include the aligned files and Gblock executable program 
mkdir gblock-tests
#Enter Gblocks folder
cd Gblocks_0.91b
#Copy executable file to the folder
cp Gblocks /yourdesirepath/gblock-tests/
#Go to the gblock-tests folder
cd /yourpath/gblock-tests

RUN GBLOCKS

#Don't forget to copy the .fasta files to your folder before executing (you could include more parameters, see Gblocks Documentation)
Gblocks <filename.fasta> -t=d -b5=n -p=y 

For more than one file, you could use this simple loop (download gblocks.sh)

#sometimes there is a problem with .sh format, you could use 'dos2unix' function to solve it... install with sudo apt-get install dos2unix
dos2unix gblocks.sh
#run the file
sh gblocks.sh

The GBLOCKs output will be two extra files per .fasta files, (1) .fasta-gb and (2) .html. Then you could organize it in folders (Optional).

mkdir fasta-gb html
mv *.fasta-gb fasta-gb/
mv *.htm html/

The .htm file gives you information about the alignment size, the blocks and other details. This file will be useful to construct the .cfg file for PF2 (see .cfg file section).

On the other hand, the .fasta-gb files need to be converted to .fasta file to concatenate in the next step. You could use the fastagb2fasta.sh make sure to have it in the same folder as the .fasta-gb files).

cd fasta-gb #go to the folder with the fasta-gb files and the .sh file
dos2unix fastagb2fasta.sh
sh fastagb2fasta.sh 

**Comment: ** Gblocks results differs in Linux or Terminal.

Format output file for PF2

Concatenate .fasta files

Input:

  1. .fasta files per each gene

Running in Terminal (Ubuntu 20.04.5 LTS for WSL)

#Install seqkit (if you don't have it)
conda install -c bioconda seqkit
#concatenate in one file
seqkit concat *.fasta > output.fasta
#------------ format into a single line fasta 
#delete enter in all the file
tr -d "\n" < output.fasta > dummy
#includes an "enter" per specie
sed -i -E "s/>/\n>/g" dummy
#delete first empty row
awk 'NR>1' dummy > dummy2
#add enter to ID, only considers the first eight characters, but you could consider a longer name changing the 8 
sed -e "s/>.\{8\}/&\n/g" < dummy2 > output-cc.fasta
rm dummy dummy2

Other options (not recommended)

Convert .fasta to .phy for Partition Finder 2

The alignment need to be in phylip format (.phy) to run Partition Finder 2.

The only program that give the right format is Geneious Prime (the free version). 0. Install Geneious Prime (Free version)

  1. File > Import File(s)
  2. Click on "Keep Alignment"
  3. Select you alignment
  4. File > Export > Documents and select "Phylip alignment (.phy)" in the "Files of Type" box

Partition Finder 2 (PF2)

Rob Lanfear Manual: https://www.robertlanfear.com/partitionfinder/assets/Manual_v2.1.x.pdf Rob Lanfear Tutorial: http://www.robertlanfear.com/partitionfinder/tutorial/

Environment setting in Terminal (Ubuntu 20.04.5 LTS for WSL)

Create environment for PF2 installation

conda create --name py2 python=2 
#Activate (or enter) to the environment
conda activate py2 
#you will notice that 'base' changed to 'py2'
#install dependencies
conda install numpy pandas pytables pyparsing scipy scikit-learn

Installation of PF2

wget https://github.com/brettc/partitionfinder/archive/v2.1.1.tar.gz
#Decompress file
tar xfz v2.1.1.tar.gz

Exactly, you only need to decompress it.

Run Partition Finder 2

Input files:

  1. Phylip alignment file (see Gblocks - Output)
  2. the .cfg file (to write one see .cfg file section
  3. PartitionFinder.py (included in the zippe file - see Installation of PF2)

To run PF2,

  1. the .phy and .cfg files need to be in the same and separate folder from all files you are working on (wherever you like, as long you know the path).
  2. You should be on the partitionfinder-2.1.1 folder, where script would call the other programs.
python PartitionFinder.py /usr/yourpath/output.phy

The parameters of the analysis need to be defined in the .cfg file (what includes? see this section) that the program would call during the analysis.

Now, the interesting part comes... how to interpret the data according to the models (yes, there are more than one parameter you should set for your analysis)? Take a look to Rob Lanfear tutorial.

Possible errors (and solutions) when you attempt to run it

No. Error Reason Solution
1 ImportError: No module named 'log tools' Wrong Python version Create a new python environment with only 2.7.x version
2 Could not find the dependency numpy Wrong Python version or forgot to install dependency Use Sol. 1 and install dependencies
3 Command 'python' not found, did you mean: command 'python3' from deb python3 command 'python' from deb python-is-python3 Wrong python version Use Sol. 1

How to know which python version are you using? python --version

If appears 3.10.x or 3.x, that's probably the default python version in your system. Any environment you create will set up to the systems' python version, unless you specify that you want python2 or other version. To work with PartitionFinder should appear as 2.7.18 or 2.7.x version (see Environment Setting)

**Possible errors (and solutions) after you manage to run it **

#Problem 1: Does not find the .cfg file or calling the program from a different folder

INFO     | 2023-03-20 19:41:53,459 | Note: NumExpr detected 20 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO     | 2023-03-20 19:41:53,459 | NumExpr defaulting to 8 threads.
INFO     | 2023-03-20 19:41:53,552 | ------------- PartitionFinder 2.1.1 -----------------
INFO     | 2023-03-20 19:41:53,552 | You have Python version 2.7
INFO     | 2023-03-20 19:41:53,552 | Command-line arguments used: PartitionFinder.py output.phy
INFO     | 2023-03-20 19:41:53,552 | ------------- Configuring Parameters -------------
INFO     | 2023-03-20 19:41:53,552 | Setting datatype to 'DNA'
INFO     | 2023-03-20 19:41:53,552 | Setting phylogeny program to 'phyml'
INFO     | 2023-03-20 19:41:53,553 | Program path is here /home/w/partitionfinder-2.1.1/programs
ERROR    | 2023-03-20 19:41:53,553 | No such folder: ''
ERROR    | 2023-03-20 19:41:53,553 | Failed to run. See previous errors.

#Solution 1: create the .cfg file (see .cfg file section) and save it in a separate folder with the .phy file. Make sure you are running from partitionfinder-2.1.1 folder, not from your file folder.

#Problem 2: Does not find the .phy file

INFO     | 2023-03-21 09:06:04,778 | ------------------------ BEGINNING NEW RUN -------------------------------
INFO     | 2023-03-21 09:06:04,778 | Looking for alignment file './infile.phy'...
ERROR    | 2023-03-21 09:06:04,778 | Failed to find file: './infile.phy'. Please check and try again.
ERROR    | 2023-03-21 09:06:04,778 | Failed to run. See previous errors.

#Solution 2: correct the .phy filename in your .cfg file, as you see I forgot to change infile.phy for output.phy

How to write a .cfg file?

The author has a comprehensive Rob Lanfear tutorial how to write one and the meaning of each parameter. Just a quick revision in here.

A .cfg looks like this

## ALIGNMENT FILE ##
alignment = youralignment.phy;

## BRANCHLENGTHS: linked | unlinked ##
branchlengths = linked;

## MODELS OF EVOLUTION: all | allx | mrbayes | beast | gamma | gammai | <list> ##
models = all;

# MODEL SELECCTION: AIC | AICc | BIC #
model_selection = aicc;

## DATA BLOCKS: see manual for how to define ##
[data_blocks]
gene1=1-444;
gen2=445-980;
geb3=981-1540;

## SCHEMES, search: all | user | greedy | rcluster | rclusterf | kmeans ##
[schemes]
search = greedy;

The data block section may seem a problem if you work with more than 10 genes, as you will have to take a look to the size of each alignment from gblocks. Also, make sure items appear in the same order as the sequences were concatenated (see Concatenate .fasta files)

An alternative is to use the .htm reports from gblock.

grep -oE '(New number of positions: <b>)[^ ]*' *.htm > alg-size.csv
sed -i "s/.aligned.cut.fasta-gb.htm:New number of positions: <b>/,/g" alg-size.csv
tr -d "</b>" < alg-size.csv > alg-size-final.csv

Then you could open it with excel, and give the desired format with "sum" and "concatenate" functions.

(Version to run in the terminal - coming soon!!!)

If you explore the examples (included in the partitionfinder folder), you may see that the DATA BLOCK section could be written as well like this,

Gene1_pos1 = 1-789\3;
Gene1_pos2 = 2-789\3;
Gene1_pos3 = 3-789\3;

Contrastingly, in the example .cfg (above), I don't have the exact codon positions, because I pass it through GBLOCKs and selected the most conserved regions. Do you know any other reason why? Please leave it in the comments.

Output: Partitions for your alignment Depending of the analysis you will get

  • The best substitution model
  • Subsets in Nexus, RAXMl and Bayes formats.
  • Chartpartitions.

Close PF2

conda deactivate

Now you could return use any environment with python3 or superior.

Phylogenetic analysis

Maximum Likelihood - RAXMl

There are two ways to run RAXMl analysis.

  1. CIPRES website (you would need to create an account)
  2. In the Ubuntu Terminal

I'm going to comment about the second option. The Exelis Lab is the group that developed this software and also provide interesting examples. Exelis Lab RAXMl: https://cme.h-its.org/exelixis/web/software/raxml/

Download the infomation from the GitHub repository (https://github.com/stamatak/standard-RAxML). Go to Code > Download ZIP file and follow the instruction follow the instructions from the RAxML hands-on session in the step. (Yes! you will need to run all from the Terminal). If you consider neccesary, you could create a RAXMl environemnt, not to disturb python2 of "py2" environment or any other.

Run RAXML

You could follow this tutorial. You will need (in the same folder)

  1. Alignment file in .phylip format
  2. Partition squeme from PF2 in simple .txt file
  3. The executble file raxmlHPC (see tutorial to generate)

Example 1:

#One bootstrap
./raxmlHPC -m GTRGAMMA -p 12345 -q partition.txt -s alignment_file.phy -n T21

Example 2:

#Fast bootstraping
./raxmlHPC -f a -m GTRGAMMA -p 12345 -q partition.txt -x 12345 -# 100 -s alignment_file.phy -n T1
What is T1, T2, T3 ... Tn?

Is a suffix not to overwrite the RAXMl results if run different experiments.

How to include the partitions from the PF2?
  1. Save the partitions obtained for RAXMl from the previous PF2 analysis in a simple text file (e.g. partitions.txt). You will include (1) the .phy file and (2) the .txt partition file in the command line. according to "Step 6: Partitioned Analysis" and continue with the analysis.

Possible errors (and solutions)

Problem #1.a

Command 'make' not found, but can be installed with:

sudo apt install make        # version 4.2.1-1.2, or
sudo apt install make-guile  # version 4.2.1-1.2

Solution #1.a

sudo apt install make

Problem #1.b: 'make' installation didn't work out

rm -f *.o raxmlHPC
gcc  -D_GNU_SOURCE -fomit-frame-pointer -funroll-loops -O2 -msse    -c -o axml.o axml.c
make: gcc: Command not found
make: *** [<builtin>: axml.o] Error 127

Solution #1.b: Install gcc dependency

sudo apt-get install gcc

Problem #1.c ... but Sometimes "gcc" is not found by python, but could not be installed directly until some dependencies are updated. That's the reason for executing the first two lines. Then runs smoothly. Sometimes is neccesary to run a couple of times to update them, mirrors where they download sometimes had a problem.

Ign:1 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 linux-libc-dev amd64 5.4.0-139.156
Err:1 http://security.ubuntu.com/ubuntu focal-updates/main amd64 linux-libc-dev amd64 5.4.0-139.156
  404  Not Found [IP: 91.189.91.39 80]
E: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/l/linux/linux-libc-dev_5.4.0-139.156_amd64.deb  404  Not Found [IP: 91.189.91.39 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?

Solution #1.c:

sudo apt-get update
sudo apt-get install build-essential

Now, you could solve problem 1.b and 1.a to run the command make -f Makefile.gcc. There are different version for Makefile command like SSE3 or SSE3.PTHREADS you could try.

Problem #2: When copying the raxmlHPC* files appears

cp: target '/yourfolder/' is not a directory

Solution #2: The directory need to include "~/" before and close with "/"

cp raxmlHPC* ~/argo/

_Problem #3:

Command 'raxmlHPC' not found, but can be installed with: sudo apt install raxml

Solution #3: Add "./" before running (not included in the RAXMl tutorial)

./raxmlHPC -m BINGAMMA -p 12345 -s output.phy -n T1

Problem #4: When use ./raxmlHPC -m BINGAMMA -p 12345 -s output.phy -n T1

ERROR: Bad base (A) at site 1 of sequence 1
Printing error context:

30 7955
nc_015248 AGAATATATAAAAAATTATAT

Solution #4: BINGAMMA is designed for protein alignments, for DNA use GTRCATX. Also commented in the RAxML hands-on session, in the "Step 3: Getting started".

Problem #5: permission denied

(raxml) w@PC-i9:~/mitoanalysis/raxml$ ./raxmlHPC -m GTRGAMMA -p 12345 -s mito-cc.phy -n T21
-bash: ./raxmlHPC: Permission denied

Solution #5: Change directory permission to "executable" with chmod

(raxml) w@PC-i9:~/mitoanalysis/raxml$ chmod +x raxmlHPC
(raxml) w@PC-i9:~/mitoanalysis/raxml$ chmod +x raxmlHPC-PTHREADS-SSE3
(raxml) w@PC-i9:~/mitoanalysis/raxml$ chmod +x raxmlHPC-SSE3

Visualize phylogeny

FigTree http://tree.bio.ed.ac.uk/software/figtree/ is a good alternative to visualize the resulting phylogeny from the RAXMl. You could change label names in the phylogeny.

sed -i 's/code1/long_name1/g' yourfilename.T1;

Bayesian Analysis - BEAST

Sources

For dos2unix:

Basic bash use:

For Conda environments:

Text files / fasta files editing in Terminal:

For RAXMl:

Extra Rename file names

rename 's/(\w+).+/$1.fasta/' *.fasta