View on GitHub


UCE Bioinformatics Pipeline

NOTE: this repository was intially created by Wilson X Guillory wxgillo.


This is a tutorial for the phylogenomic workflow used by the Brown lab, where we use UCEs to uncover evolutionary histories, mostly in Neotropical poison frogs (Dendrobatidae). In this tutorial I provide sample data and take you through the steps of read processing, sequence assembly, read-to-locus matching, and sequence alignment, and finally provide a few examples of phylogenetic analyses that can be performed on UCE data.


Directory structure and example files

In this tutorial I will be using a Linux machine (named Bender) for all steps. We need to start by creating a directory to put the example data in.
In my examples I will be using Bash commands on the Linux command line (Terminal), but many of the mundane commands (switching directories, moving files, creating directories), can also be completed simply using the desktop interface. In the following suite of Bash commands, we will move to the desktop, create a folder tutorial from which will be working, go into tutorial, and create a subfolder 1_raw-fastq, in which we will place our raw read data.

cd ~/Desktop  
mkdir tutorial
cd tutorial
mkdir 1_raw-fastq

Subsequent steps will be conducted in numbered folders following the 1, 2, 3_… numbering scheme, which will make it easier to keep track of what’s going on.

Now we need to move the provided example files into tutorial. Download the following twelve .fastq files from here and place them into 1_raw-fastq. (Probably easiest to do this via desktop). When you run the following command:

 ls 1_raw-fastq/

You should see the following:


These twelve files correspond to six samples:

A bit about the organization and naming of our samples: They are organized into “plates”, where each was a batch of samples sent to RAPiD Genomics (Gainesville, FL). Each plate folder contains two .fastq files for each sample in that plate, an info.txt file that contains the adapters, and a SampleSheet.csv that contains valuable information on each sample, including connecting the RAPiD Genomics sample name with the Brown lab sample name.
The way sample names work is that there is generally a truncated version of the species name (Apete in the first sample above, short for Ameerega petersi), followed by a collection number (JLB07-001, this stands for Jason L. Brown, 2007 trip, sample 001), and then followed by a two part unique ID code (0008-AAAI). The truncated species name is sometimes unreliable, especially for cryptic species such as A. hahneli. Generally the unique ID code is the most searchable way to find each sample’s info in a spreadsheet. The numerical and alphabetical components of the code are basically the same, so either is searchable by itself (for instance, you can search “0008” or “AAAI” with the same degree of success).

Note that Plate 2 is a bit different than the others in that there are four .fastq files for each sample rather than two. Basically, an additional sequencing run was necessary. We combined both runs for each lane for each sample and put them into the folder combo that is located inside the plate2 folder. Use those files.

Also place the various files in the example-files folder in this repository directly into the tutorial folder on your computer (not a subfolder).

Using miniconda2

On my computer I’ve installed Phyluce (the suite of programs we’ll use to do most tasks) via Miniconda2, a package manager frequently used by computational biologists. After installing Miniconda2, you can use the command conda install phyluce to install Phyluce and all dependencies (including Illumiprocessor, which is the first part of the pipeline). However, I’ve added an extra step by installing Phyluce in its own environment, using the command conda create --name phyluce phyluce. You need to activate the environment before you can use any of the Phyluce commands. Go ahead and do this with:

conda activate phyluce

You should notice a (phyluce) modifier appear before your command prompt in Terminal now. If you want to leave the Phyluce environment, run:

conda deactivate

Read trimming

The first real step in the process is to trim the raw reads with Illumiprocessor, a wrapper for the Trimmomatic package (make sure to cite both). Illumiprocessor trims adapter contamination and low quality bases from the raw read data. It runs fairly quickly, but be warned that this is generally one of the most onerous steps in the whole workflow, because getting it to run in the first place can be fairly challenging. One of the main difficulties comes in making the configuration file, which tells Illumiprocessor what samples to process and how to rename them.

Making the Illumiprocessor configuration file

Since we are only using twelve samples (note that there are two .fastq files per sample), I simply wrote the configuration file by hand without too much difficulty. In cases where you want to process more samples, though, you will want to streamline the process. More on this later.

First, I will explain the structure of the configuration file. The file consists of four sections, each identified by four headers surrounded by [square brackets]. The first section is listed under the [adapters] header. This simply lists the adapters. They should be the same for all samples. You can get the adapters from the readme file in each plate folder (the name is consistent). The section should look like this:


This is generally the easiest part. The i7 adapter is the “right” side (if oriented 5’ to 3’), and the i5 is the “left”.

The second section is under the [tag sequences] header. This lists the sequence tags used for each sample. We are using dual-indexed libraries, so there will be two tags per sample, one corresponding to the i7 adapter and the other corresponding to i5. This section will look like this:

[tag sequences]

Note the structure: to the left of the colon is the tag name (must start with either i5 or i7, and should be unique to the sample), and to the right of the colon is the tag sequence. The sequences can be obtained from the SampleSheet.csv. The tag names are up to you.

The third section is under the [tag map] header. This connects each sequence tag to a particular sample (remember, there are two per sample). It should look something like this:

[tag map]

The structure is the sample name, separated by a colon from the two corresponding tag names (which themselves are separated by commas).
Let’s talk about the sample name, because it looks like a bunch of gobbledy-gook. This sample name is the one returned by RAPiD Genomics, and is only connected to a particular sample via the SampleSheet.csv file. In this case, “RAPiD-Genomics_HJYMTBBXX_SIU_115401_P01_WG01” is the same sample as “ApeteJLB07-001-0008-AAAI”. Basically all of it can be ignored except for the last two bits (for this sample, “P01_WG01”). That first bit is the plate the sample was sequenced in, and the second bit is the unique identifier for the sample. Note that different plates will use the same identifiers, so including the plate is important (notice that we have both P01_WG01 and P04_WG01). Also note that this sample name corresponds to the actual filenames of the two corresponding .fastq files (remember, there are two .fastq files per sample). However, it is partially truncated, only going as far as the unique ID of the sample. The full filename for this particular sample is RAPiD-Genomics_HJYMTBBXX_SIU_115401_P01_WG01_i5-512_i7-84_S1152_L006_R2_001.fastq.gz, while in our configuration file we only include the name up to the WG01 part. It is unnecessary to include the rest.

The fourth and final section is under the [names] header. It connects the nigh-unreadable RAPiD sample name to a more human-readable name of your choice. The name can be whatever you want, but we’re going to make it the actual Brown Lab sample name. It should look like this:


The structure is the RAPiD sample name, separated by a colon from the Brown Lab sample name. Note that the first part is identical to the first part of the previous section; we are just renaming samples to which we already assigned sequence tags. Remember, the human-readable sample names are also obtained from the SampleSheet.csv file.

An important note is that this is the stage that decides your sample names for the rest of the process, so be careful. A few things that you must make sure of are the following:

Historically I have found that spaces and underscores are in a few sample names that may make their way into your file. It is often good to ctrl+F and search for these, and replace them with - dashes.

Making the configuration file quickly with Excel

For large numbers of samples, making the configuration file by hand can be taxing. For six samples, it took me about twenty minutes to gather all of the various data from different sample sheets and organize it (the data being from different plates did not help). For this reason, we have made an Excel spreadsheet in which you can copy and paste relevant information directly from the sample sheets for large numbers of samples, and it will automatically organize the information in the correctly formatted manner for the configuration file. I have provided a file named UCE_cleanup.xlsx in the example-files directory of this repository that you can use as a template and example.

Essentially the first step is to copy and paste the entire sample sheet, wholesale, into the first sheet of the file. Now you can directly access the various components and combine them in various ways to form the Illumiprocessor configuration file. The tagi5 sheet creates the i5 portion of the [tag sequences] section, and the tagi7 sheet creates the i7 portion. The tag map sheet creates both the [tag map] and [names] sections of the file. This may require a bit of finagling, because the sample tag (the [Name for Tag Map] column) needs to be truncated from the full filename (the [Sequence_Name] column). Samples from different plates will usually need to be truncated a different amount of characters, so you may need to modify the equation in the [Name for Tag Map] column to suit your needs. Note that in this example, we did not truncate to the unique ID (WH01, WH02, etc.), including a bit more information; this is not really necessary. This step can be a source of downstream errors because you may have truncated more or less than you thought you did, creating an inconsistent tag map section.

Just copy the different sections of the spreadsheet into a text file, and your configuration should be ready to go.

Running Illumiprocessor

This is the part where things generally go horribly, woefully wrong. I don’t think I’ve ever run this command and had it work the first time. That’s because the configuration file needs to be exactly, perfectly correct, which is difficult to do with large numbers of samples. I include a “troubleshooting” section after this one that lists some of the common problems.

Here is the command to be used in this particular case:

illumiprocessor \
    --input 1_raw-fastq \
    --output 2_clean-fastq \
    --trimmomatic ~/Desktop/BioTools/Trimmomatic-0.32/trimmomatic-0.32.jar \
    --config illumiprocessor.conf \
    --r1 _R1 \
    --r2 _R2 \
    --cores 19

Update JLB 8/2020: Here is another version of this command that works if you are getting “Errno 8- Exec format error” after executing CHMOD. Note that trimmomatic is already installed within Phyluce. The only reason to use the above code specifying Trimmomatic is if you do not want to updated Phyluce and want to use the latest version of Trimmomatic (check version by going to “Home\anaconda2\envs\phyluce\share\trimmomatic-0.3XX", where XX is the version number).

illumiprocessor \
    --input 1_raw-fastq \
    --output 2_clean-fastq \
    --config illumiprocessor.conf \
    --r1 _R1 \
    --r2 _R2 \
    --cores 4

Note that a \ backslash in Bash “escapes” the next character. In this particular command, the backslashes are escaping the invisible \n newline character, so that each argument can be written on a separate line for visual clarity. The entire command is generally written on a single line, but this becomes hard to read as arguments are added.

JLB Note: If you have any error and are prompted to input “Y/n”, remember “Y” is capitalized and “n” is lower case —else the command will error out.

You will see that most of the commands we use will be structured in this way. Here is how the command is structured (future commands will not be explained in such detail):

Hopefully, when you enter the command into terminal (make sure you are in the tutorial main directory and not a subfolder), it says “Running” rather than an error message. This process took my computer only a few minutes to run, but adding samples or using fewer cores will require longer times.

You should now have a folder in your tutorial directory named 2_clean-fastq. Running the following command:

ls 2_clean-fastq

should yield the following list of samples:

AbassJB010n1-0182-ABIC      AflavMTR19670-0522-AFCC   ApeteJLB07-001-0008-AAAI
AbassJLB07-740-1-0189-ABIJ  AhahnJLB17-087-0586-AFIG  AtrivJMP26720-0524-AFCE

Note that there are now 6 files rather than 12 (1 per sample rather than 2), and that the samples have been renamed to match Brown Lab names. You can actually tell what they are now!

Troubleshooting Illumiprocessor

Sequence assembly is basically the process of putting your raw sequence data into a format at least reminiscent of the way the DNA was organized in life. Imagine that you have one hundred copies of a book, and you put them all in a paper shredder. Now you have to reconstruct the book from the shredded chunks. Since you have multiple copies, not all of which were shredded in the same way, you can find chunks that partially match and use these matches to connect to the next sentence. Now, however, imagine that the book contains several pages that are just “All work and no play makes Jack a dull boy” over and over. This makes the process much more difficult because this construction causes extreme ambiguity. This is an analogy to how assembly works, and also how repeating DNA elements make assembly difficult. This is why many computational biologists have made their careers based on writing powerful de novo assembly programs.

PHYLUCE can assemble sequences using the programs velvet, ABySS, and Trinity. Out of these, ABySS is probably regarded as “the best” in terms of assembly accuracy, but Faircloth recommends using Trinity out of a combination of good speed, and providing longer contigs. I have used Trinity in all of my Brown Lab projects and will use it in this tutorial.

Just like with Illumiprocessor, the first thing we need to do is… make a configuration file!

Making the assembly configuration file

Compared to the Illumiprocessor config file, the assembly one is very easy to make (and can be downloaded as assembly.conf from the example-files directory in this repository). You can make it with a simple Bash script:

cd 2_clean-fastq
echo "[samples]" > ../assembly.conf
for i in *; \
   do echo $i":/home/bender/Desktop/tutorial/2_clean-fastq/"$i"/split-adapter-quality-trimmed/"; \
   done >> ../assembly.conf

JLB Note 8/2020: Be sure to change your computer name. In the code above it is ‘bender’, you can find this by looking at Terminal window- in the green text before the “@” is the computer name. Also be sure to check the assembly.conf file. Sometimes the last line is not correct and will make it appear that the code ran incorrectly.

The first command moves you into the 2_clean-fastq directory, and the second initiates a text file with the [samples] header. The third command is a for loop that prints out the name of each sample (represented by $i), followed by a colon and lastly the file path to the split-adapter-quality-trimmed folder inside of each sample’s individual folder. The split-adapter-quality-trimmed folders contain .fastq.gz files with the trimmed reads. The assembler needs to know the location of each of these folders for each sample.

Bash tips:

After running the command, it’s usually a good idea to manually check the file to ensure that you have only the [samples] header and the sample paths listed. If there were any other files in the 2_clean-fastq folder, they would be listed in this file as well and need to be removed. The file should look like this:


You should go back into the tutorial directory after this. Use:

cd ..

Running Trinity to assemble cleaned reads

With the assembly configuration file completed, we can now run Trinity. Use the following PHYLUCE command:

phyluce_assembly_assemblo_trinity \
    --conf assembly.conf \
    --output 3_trinity-assemblies \
    --clean \
    --cores 19

Hopefully your run works the first time. This is generally one of the longest-duration steps in the pipeline - each assembly generally takes an hour to complete with 19 cores. For a set of 96 samples, this process can take most of a working week. I like to run it over a weekend. For these six samples, the run took about four and a half hours with 19 cores.

When the assemblies have finished, you should have a folder called 3_trinity-assemblies in your tutorial folder. Using the command:

ls 3_trinity-assemblies

should display:

AbassJB010n1-0182-ABIC_trinity      AhahnJLB17-087-0586-AFIG_trinity  contigs
AbassJLB07-740-1-0189-ABIJ_trinity  ApeteJLB07-001-0008-AAAI_trinity
AflavMTR19670-0522-AFCC_trinity     AtrivJMP26720-0524-AFCE_trinity

JLB Note 8/2020: This is the longest step. I suggest for going through this tutorial that you run only 1 or 2 samples. To do this, edit your ‘conf.assembly’ file in basic text editor amd remove all but a few samples. If - or when - this works for you with no issues, I then suggest you download the other assembled files from here for use for the remaining tutorial. Please use the files you created*, only copying the missing files here. *this will help us troubleshoot any issues you may encounter during this step - sometimes they are not obvious

The assembly has generated a set of six folders (one per sample) as well as a folder named contigs. Inside each sample folder, you will find a Trinity.fasta file that contains the assembly, as well as a contigs.fasta link that links to that .fasta file. The contigs folder further contains links to each sample’s .fasta file.

Troubleshooting Trinity

Generally, the most common error with Trinity will generally be caused by specifying incorrect file paths in your configuration file. Double-check them to make sure they’re correct.

Other possible issues can arise if you copied the trimmed reads over from another directory without preserving folder structure. This can break the symlinks (symbolic links) that are in the raw-reads subfolder of each sample’s folder. They need to be replaced for Trinity to function. You can do that with the following Bash commands:

cd 2_clean-fastq
echo "-READ1.fastq.gz" > reads.txt
echo "-READ2.fastq.gz" >> reads.txt
ls > taxa.txt
for j in $(cat reads.txt); \
   do for i in $(cat taxa.txt); \
         do ln -s $i/split-adapter-quality-trimmed/$i$j $i/raw-reads/$i$j; \
         done; \
cd ..

This creates two files: the first, reads.txt, contains a list of two file endings that will be looped over to construct proper symlink names; the second, taxa.txt, contains a simple list of all of the samples in the 2_clean-fastq folder. The next command is a set of two nested for loops that basically reads as “For both file endings listed in reads.txt, and then for every sample listed in taxa.txt, generate a symlink in the raw-reads directory for that sample that leads to the corresponding .fastq.gz file in the split-adapter-quality-trimmed folder for this sample”.

JLB Notes 8/2020: I received the error ‘jellyfish: not found’, I solved this by reinstalling Jellyfish using this code: ‘conda install jellyfish=2.2.3’.

JLB Notes 10/2020: Today date my most rage inducing issue - aside from switch java bullshit [see bottom]: ‘Error: gzip: stdout: Broken pipe’ was found in the log files of a Trinity for each sample. Turns out this is a known Trinity bug - the temp fix is remove the ‘singleton’ reads from each of the ‘split-adapter-quality-trimmed’ folders in folder ‘2_clean-fastq’. For details see.

More Bash tips:

The loop will loop through every file ending in .fasta located in the 3_trinity-assemblies/contigs folder, and then process it using the phyluce_assembly_get_fasta_lengths script. (Note that these are not actual .fasta files, but links to them.)

In order listed, the summary stats printed to assembly_stats.csv will be:

sample,contigs,total bp,mean length,95 CI length,min,max,median,contigs >1kb

Locus matching

The next step is going to be a sequence of PHYLUCE commands that essentially takes your assemblies, finds the UCEs inside of them, and then conveniently packages them so that you can later align them.

Matching contigs to probes

The first of these commands is phyluce_assembly_match_contigs_to_probes, which matches your contigs to the set of UCE probes used during sequencing. If you haven’t already, download the uce-5k-probes.fasta file from the example-files directory of this repository and put it in tutorial. This .fasta file contains the sequences of these UCE probes. The command is as follows:

phyluce_assembly_match_contigs_to_probes \
    --contigs 3_trinity-assemblies/contigs \
    --probes uce-5k-probes.fasta \
    --output 4_uce-search-results

For a lot of samples, this command can last long enough to give you a coffee break. For our six samples, it should take only a few seconds. The output will be located in the new folder 4_uce-search-results. Inside it, you will find six .lastz files, which is another sequence storage format. There will also be a probe.matches.sqlite file, which is a database relating each contig to each probe.

If you have issues getting this command to work, it’s likely that it’s because you copied assemblies over from another directory, which breaks the links in the contigs folder of 3_trinity-assemblies. You can resurrect these links using the ln -s command and a for loop, or by using it individually for particular samples. You need to link to the Trinity.fasta file in that particular sample’s assembly directory. You will also need to remake links if you alter a sample’s name, say if you forgot to remove periods from names or something like that.

Extracting UCE locus data

The next portion takes the matched contigs/probes and extracts usable sequence data in .fasta format for each sample, organized by UCE locus. There is a bit of setup we have to do first.

Creating taxon sets

The first thing we have to do before we move on is create one or more “taxon sets”. By this I mean a set of samples that you would like to work on. For us, this is just going to be the complete set of six samples we’ve been processing this whole time. But for other projects, you may wish to use a specific subset of samples at times, and the whole set at others. For example, in my own UCE projects I often have a taxon set that is the full set of samples, as well as another, smaller, one with one representative sample per species.

To create a taxon set, we need to create yet another configuration file. Don’t worry, this one is simple. The file just needs a list of the samples to be used, underneath a header in [square brackets] that gives the taxon set a name. Below is a block of code that creates a file called taxa.txt, which contains a taxon set called all that contains all six samples.

echo "[all]" > taxa.txt
ls 4_uce-search-results >> taxa.txt

This is a very crude script because we do have to go in and edit the file. It wouldn’t be too hard to write a script that makes the file without any other intervention, but it’s not hard to make the edits manually. If you open up the file in a program like gedit, it should look like this:


Notice the [all] header that gives the taxon set its name. The two things we need to do are:

The final taxa.txt file should look like this:


If you have more than one taxon set, you can put all of them in the same file, or put them in separate files.

Next we need to create a file structure, basically a folder for each taxon set. We’ll be doing most of the remainder of our work in that folder. Use the following command:

mkdir -p 5_taxon-sets/all

The mkdir command simply makes a directory. The -p flag allows you to make nested directories. We now have a directory named 5_taxon-sets that contains another directory called all that corresponds to our [all] taxon set. If we had more than one taxon set in our taxa.txt file, we would create a directory for each of them inside 5_taxon-sets.

Getting .fasta files for each sample and UCE locus

Next up is a set of commands that takes our .lastz files and .sqlite database, and turns them into .fasta files for each UCE locus and each sample in our taxon set. The first command is phyluce_assembly_get_match_counts, used below:

phyluce_assembly_get_match_counts \
    --locus-db 4_uce-search-results/probe.matches.sqlite \
    --taxon-list-config taxa.txt \
    --taxon-group 'all' \
    --incomplete-matrix \
    --output 5_taxon-sets/all/all-taxa-incomplete.conf

This command generates a configuration file, all-taxa-incomplete.conf, that is located in 5_taxon-sets/all.
Now we’re going to go into 5_taxon-sets/all and make a log directory:

cd 5_taxon-sets/all
mkdir log

Then we use the following command

phyluce_assembly_get_fastas_from_match_counts \
    --contigs ../../3_trinity-assemblies/contigs \
    --locus-db ../../4_uce-search-results/probe.matches.sqlite \
    --match-count-output all-taxa-incomplete.conf \
    --output all-taxa-incomplete.fasta \
    --incomplete-matrix all-taxa-incomplete.incomplete \
    --log-path log

This is a good command to run over your lunch break if you have a lot of samples. For us, it should only take about a minute or two. This command generates a big .fasta file, all-taxa-incomplete.fasta, that contains each sample’s set of UCE loci. I recommend against attempting to open the file in a GUI program to view it, as it is probably so big that it’ll lock up your computer. You can use less -S all-taxa-incomplete.fasta to view it seamlessly in Terminal.

If you look at the to-screen output of the previous command, it will tell you how many UCE loci were recovered for each sample. We targeted around 5,000 loci in total, but for most samples we retain ~1,500. This is pretty normal for our poison frog samples.

Getting summary statistics for our UCE loci

We can use a few commands to look at summary stats pertaining to the UCE loci for each sample. First we need to “explode” the huge .fasta file we generated in the previous step to create a separate .fasta file for each sample.

phyluce_assembly_explode_get_fastas_file \
    --input all-taxa-incomplete.fasta \
    --output exploded-fastas \

This generates a folder exploded-fastas that contains six .fasta files, one for each sample, containing the (unaligned) UCE loci for that sample. Next use this command to generate summary stats:

for i in exploded-fastas/*.fasta;
    phyluce_assembly_get_fasta_lengths --input $i --csv;

This is the same command that you might have used to get assembly summary statistics earlier. Like that case, the output will be organized as a .csv file with the following columns:

sample,contigs,total bp,mean length,95 CI length,min,max,median,contigs >1kb

Sequence alignment

The final step before we get to the phylogenetic analysis is sequence alignment. Like sequence assembly, alignment is a classic problem in bioinformatics that has built the careers of several computational biologists. Basically, we need to “align” DNA sequences (think of strings of ATGCGCGTACG… etc.) for each of our samples, so that homologous base pairs are located in the same column. This is a difficult problem because divergent taxa can have very different sequences even for the same gene, making the question of “are these base pairs actually homologous?” fairly nebulous. The desired end results of sequence alignment is a matrix where every column corresponds exactly to homologous base pairs for each taxon (which are represented by rows). The phylogenetics program you use will compare the sequences in the alignment to determine their evolutionary relationships. More divergent sequences should lead to those species being further removed from each other in the phylogeny.

PHYLUCE is integrated with the alignment programs MAFFT and MUSCLE. Faircloth recommends using MAFFT, but it’s mostly a matter of personal preference. As a creature of habit, I always tend to use MUSCLE, and will be using that program in this tutorial. Also note that PHYLUCE will perform edge-trimming of alignments (basically an alignment cleaning step) unless otherwise specified. Faircloth recommends this for “closely-related taxa” (<30-50 Ma), which our poison frogs are, so we will allow the edge-trimming.

To perform per-locus alignments, use the following command:

phyluce_align_seqcap_align \
    --fasta all-taxa-incomplete.fasta \
    --output-format nexus \
    --output muscle-nexus \
    --taxa 6 \
    --aligner muscle \
    --cores 19 \
    --incomplete-matrix \
    --log-path log

This command creates a folder called muscle-nexus that has individual per-locus .fasta files, each containing a sequence alignment for that locus. You can easily check how many loci you have by checking how many .fasta files are in this folder; in our case, we have 2,019 (what a coincidence).

If all of your alignments are getting dropped (I’ve had this problem with newer versions of Phyluce), it’s likely because you have edge-trimming turned on. Use the following command instead:

phyluce_align_seqcap_align \
    --fasta all-taxa-incomplete.fasta \
    --output-format nexus \
    --taxa 6 \
    --output muscle-nexus \
    --aligner muscle \
    --cores 19 \
    --no-trim \
    --incomplete-matrix \
    --log-path log

Then, you can trim the alignments separately with the command:

phyluce_align_get_trimal_trimmed_alignments_from_untrimmed \
	--alignments muscle-nexus \
	--output muscle-nexus-trimmed \
	--input-format nexus \
	--output-format nexus

Twomey Pipeline

Twomey Data Polishing Pipeline - OPTIONAL, but highly encouraged

This pipeline takes your current results and realligns them to concencus sequences generated from all your input sequences. The need for this arrose when Evan Twomey was checking loci-level allignments and he noted several taxa where mis-alligned at the end.

All scripts referenced here are in the folder ‘twomey_scripts’. Besure to open each script before running to make sure your directories match the script directories. If you do not check this- the scripts are guaranteed to fail.

Additonal encouragement from Evan Twomey:

Some reasons to go through all this hassle:

– This was the only thing that solved my issue with conspicuously long branch lengths for certain terminals. This is because if you think of the reference sequence as a “good” sequence with minimal or no artifacts, then only well-matching reads will align and be used to extract the sequence. Reads or parts of reads that do not align will not make it into the sequence.

– You don’t need to have full contigs for this to work, which is a big advantage. For example, if a particular individual only had spotty read coverage for a certain locus, which would preclude assembling a full-length contig, that is no problem here because individual reads can align to a reference with no requirement for contiguity.

– You can easily change the reference sequence to recover off-target sequences. This is now how I do mitochondrial assembly: I just use a mitochondrial genome from a close relative for the reference sequence instead of the UCE loci. There are also lots of nuclear loci (e.g. 28S) that can be reliably recovered across samples.

Software Dependencies:

’'’We specify the version used, but it probably doesn’t matter. If you hit any problems with any steps though, check the versions. EMBOSS ( install EMBOSS

Hisat2 (2.1.0) conda install hisat2

Samtools (0.1.19)

Seqtk (1.2) conda install seqtk

Muscle (3.8)

AMAS ( conda install AMAS and

sudo apt-get install python3
pip install amas

IMPORTANT - pay particular attendtion to where this script is installed. For Bender it is located in: “/home/bender/miniconda2/envs/phyluce/lib/python2.7/site-packages/amas/”

TrimAl (1.4)

Angsd (x): conda install -c bioconda angsd

Twomey Step 1. Make a reference for read-alignment

General goal here is to extract a consensus sequence for each UCE locus, and turn these into a reference fasta for read alignment.

1) Run the emboss_UCE_consensus script in the directory containing all your UCE loci. These should be the loci that are aligned but not filtered, because even if a UCE was recovered only for a single sample, it could be useful for read alignment. Basically don’t throw out any loci at this stage.

If you follow the phyluce pipeline, it should be this folder: “taxon-sets/all/mafft-fasta-internal-trimmed-gblocks-clean”

Then put the script in that folder and run it.


To run type:


This should spit out a new folder called ‘consensus’, where each locus is represented by a single sequence.

2) Concatenate these to a single fasta file (change to newly created consensus folder)

cat *.fasta > uce_consensus_reference.fasta

3) Index this file with hisat2 to turn it into a reference for read alignment

hisat2-build uce_consensus_reference.fasta uce_consensus_index -p 6

This will make something like eight ht2 files. You’ll need these for the next steps.

Twomey Step 2. Organize a directory structure

Make the following directories in whatever work directory you want: work_directory/fastas – This will hold the per-sample read alignment results. Empty for now.

mkdir -p work_directory/fastas

‘work_directory/reference_sets/UCE’ – This will hold your hisat2 index files and the uce_consensus_reference.fasta

mkdir -p work_directory/reference_sets/UCE

‘work_directory/samples’ – This holds your folders corresponding to each sample to be included in the phylogeny. To do this, go to the ‘clean-fastq’ folder of one of your phyluce runs. Copy (or link) these folders into this samples directory.

mkdir -p work_directory/reference_sets
mkdir -p work_directory/angsd_bams

For example, for the samples directory, the path to your sample reads would be something like: work_directory/samples/amazonica_Iquitos_JLB08_0264/split-adapter-quality-trimmed work/directory/samples/benedicta_Pampa_Hermosa_0050/split-adapter-quality-trimmed etc.

Twomey Step 3. Align per-sample reads to reference and extract sequence for each locus

All you have to do here is place the script into the work_directory/samples folder, and run it. Results will appear in work_directory/fastas as each sample finishes. However, you need to be sure that the directories are correct. I like to use full paths to minimize ambiguity. Here’s the whole script, with some comments:

Note- go to folder ‘work_directory/samples’

First index your UCE file

bwa index /home/jason/Desktop/tutorial/work_directory/reference_sets/UCE/uce_consensus_reference.fasta

Now run the bash script “”- before you run it make sure you change the directories inside of the script.


JLB Note 2/2021: If you error out - in the script “bash”, change: samtools sort $sample.bam -o $sample.sorted.bam to samtools sort $sample.bam $sample.sorted

Next run the script ‘’

Go to ‘angsd_bams’ folder

To run script type:


Twomey Step 4: Rearrange resulting fasta files into an alignment```

If the above steps worked correctly, your work_directory/‘angsd_bams’ or copy to work_directory/‘fasta’ folder should be filled with a fasta file for each sample. This step is to make these files into something useable for phylogenetics.

1) First, concatenate all these resulting fasta files. Do this into a new subdirectory:

mkdir explode
cat *fasta > explode/cat.fasta

2) Enter the ‘explode’ directory. The cat.fasta needs to be exploded so that the fastas are arranged by locus rather than by sample:

awk '/^>/{split($1,a,"[|.]")}{print >> a[2]".fa"}' cat.fasta

IMPORTANT. If this didn’t create 1000s of files, re-run this step with a slightly altered code (removing “.” from search after pipe):

awk '/^>/{split($1,a,"[|]")}{print >> a[2]".fa"}' cat.fasta
3) Reformat headers. They currently look like: > sample_locality_001 uce-con_1 and need to lose the locus ID. Run this command on the directory of by-locus .fa files (will be thousands of files)
sed -i 's/|uce-.*//' *.fa

4) Run the script in the same directory as all these .fa files. This will pad the lociso that they are all the same length (adding dashes to the sequences to make the lengths correct), so that they are recognized as alignments.

JLB 12.2020: Note - make sure all files in this folder that are *.fa and *.fasta extensions are desired - I had a rouge .fa file (“bam.fa”) other .fasta files in my folder that created a ton of confusion. The files you want shoud be named to match UCE name. I recommend creating a new folder or removing all other files from this directory (minus the .sh files)


5) Even though these loci are already aligned (because the sequences were extracted from reads that aligned to a reference), I have found that re-aligning each locus with Muscle can fix some minor alignment issues (I think this has to do with indels).

Next step, run the muscle_loop script. Navigate up to the’seqret_output’ folder and copy the ‘’ script to this folder. The ‘’ script can take a while, like a few hours.

Now create a new output folder:

mkdir muscle

Then run the script:


6) Concatenate loci with AMAS. To do this, enter the newly created ‘muscle’ directory and then run this script:

python  /home/bender/miniconda2/envs/phyluce/lib/python2.7/site-packages/amas/ concat -i *.fasta -f fasta -d dna

JLB 12.2020: Note the specific location of “” (found when installing it see above)

7) The alignment will have a bunch of question marks in it. I guess these are from the read alignments. I replace them with – so that the alignments can be cleaned up with trimal (trimal will not touch question marks). It is possible this is a bad idea.

sed -i 's/?/-/g' concatenated.out

8) Last step is to clean up the alignments with trimal to remove columns composed of mostly missing data. The gappyout option gives an alignment roughly twice as long as the no_gaps option; in my experience the trees come out identical but with support values slightly higher when using gappyout.

trimal -in concatenated.out -out trimal_gappyout.fasta -gappyout

This alignment should now be ready to use (e.g., IQ-Tree). However, I suggest you jump to locus filtering from this step.

End of Twomey Pipeline

Phasing Haplotypes

(work in progress)

Phasing is the process of inferring haplotypes from genotype data. This can be very useful for population-level analyses. We are phasing based on the pipeline The following workflow derives from Andermann et al. 2018 ( and the Phyluce pipeline (

To phase your UCE data, you need to have individual-specific “reference” contigs against which to align your raw reads. Generally speaking, you can create these individual-specific reference contigs at several stages of the phyluce_ pipeline, and the stage at which you choose to do this may depend on the analyses that you are running. That said, I think that the best way to proceed uses edge-trimmed exploded alignments as your reference contigs, aligns raw reads to those, and uses the exploded alignments and raw reads to phase your data.

.. attention:: We have not implemented code that you can use if you are trimming your alignment data with some other approach (e.g. gblocks_ or trimal_).

Exploding aligned and trimmed UCE sequences

Probably the best way to proceed (you can come up with other ways to do this) is to choose loci that have already been aligned and edge-trimmed as the basis for SNP calling and haplotype phasing. The benefit of this approach is that the individual-specific reference contigs you are inputting to the process will be somewhat normalized across all of your individuals because you have already generated alignments from all of your UCE loci and trimmed the edges of these loci.

To follow this approach, first proceed through the :ref:EdgeTrimming section of :ref:TutorialOne. Then, you can “explode” the directory of alignments you have generated to create separate FASTA files for each individual using the following (this assumes your alignments are in mafft-nexus-edge-trimmed as in the tutorial).

.. code-block:: bash

# explode the alignment files in mafft-nexus-edge-trimmed by taxon create a taxon-specific FASTA
phyluce_align_explode_alignments \
    --alignments mafft-nexus-edge-trimmed \
    --input-format nexus \
    --output mafft-nexus-edge-trimmed-exploded \

The current directory structure should look like (I’ve collapsed a number of branches in the tree):

.. code-block:: bash

├── assembly.conf
├── taxon-sets
│       └── all
│           ├── all-taxa-incomplete.conf
│           ├── all-taxa-incomplete.fasta
│           ├── all-taxa-incomplete.incomplete
│           ├── exploded-fastas
│           ├── log
│           ├── mafft-nexus-edge-trimmed
|           └── mafft-nexus-edge-trimmed-exploded
|               ├── alligator_mississippiensis.fasta
|               ├── anolis_carolinensis.fasta
|               ├── gallus_gallus.fasta
|               └── mus_musculus.fasta
└── uce-search-results

You may want to get stats on these exploded-fastas by running something like the following:

.. code-block:: bash

# get summary stats on the FASTAS
for i in mafft-nexus-edge-trimmed-exploded/*.fasta;
    phyluce_assembly_get_fasta_lengths --input $i --csv;

Creating a re-alignment configuration file

Before aligning raw reads back to these reference contigs using bwa, you have to create a configuration file, which tells the program where the cleaned and trimmed fastq reads are stored for each sample and where to find the reference FASTA file for each sample. The configuration file should look like in the following example and should be saved as e.g. phasing.conf

.. code-block:: text




[references] ^^^^^^^^^^^^

In this section you simply state the sample ID (genus_species1) followed by a colon (:) and the full path to the sample-specific FASTA library which was generated in the previous step.

[individuals] ^^^^^^^^^^^^^

In this section you give the complete path to the cleaned and trimmed reads folder for each sample.

.. attention:: The cleaned reads used by this program should be generated by illumiprocessor_ because the folder structure of the cleaned reads files is assumed to be that of illumiprocessor_ . This means that the zipped fastq files (fastq.gz) have to be located in a subfolder with the name split-adapter-quality-trimmed within each sample-specific folder.

[flowcell] ^^^^^^^^^^

The flowcell section is meant to add flowcell information from the Illumina run to the header to the BAM file that is created. This can be helpful for later identication of sample and run information. If you do not know the flowcell information for the data you are processing, you can look inside of the fastq.gz file using a program like less. The flowcell identifier is the set of digits and numbers after the 2nd colon.

.. code-block:: bash

@J00138:149:HT23LBBXX:8:1101:5589:1015 1:N:0:ATAAGGCG+CATACCAC ^^^^^^^^^

Alternatively, you can enter any string of information here (no spaces) that you would like to help identify a given sample (e.g. XXYYZZ).

Mapping reads against contigs

To map the fastq read files against the contig reference database for each sample, run the folliwing. This will use bwa mem to map the raw reads to the “reference” contigs:

.. code-block:: bash

phyluce_snp_bwa_multiple_align \
    --config phasing.conf \
    --output multialign-bams \
    --cores 12 \
    --log-path log \

This will produce an output along these lines

.. code-block:: text

2016-03-09 16:40:22,628 - phyluce_snp_bwa_multiple_align - INFO - ============ Starting phyluce_snp_bwa_multiple_align ============
2016-03-09 16:40:22,628 - phyluce_snp_bwa_multiple_align - INFO - Version: 1.5.0
2016-03-09 16:40:22,629 - phyluce_snp_bwa_multiple_align - INFO - Argument --config: /path/to/phasing.conf
2016-03-09 16:40:22,629 - phyluce_snp_bwa_multiple_align - INFO - Argument --cores: 1
2016-03-09 16:40:22,629 - phyluce_snp_bwa_multiple_align - INFO - Argument --log_path: None
2016-03-09 16:40:22,629 - phyluce_snp_bwa_multiple_align - INFO - Argument --mem: False
2016-03-09 16:40:22,629 - phyluce_snp_bwa_multiple_align - INFO - Argument --no_remove_duplicates: False
2016-03-09 16:40:22,629 - phyluce_snp_bwa_multiple_align - INFO - Argument --output: /path/to/mapping_results
2016-03-09 16:40:22,629 - phyluce_snp_bwa_multiple_align - INFO - Argument --subfolder: split-adapter-quality-trimmed
2016-03-09 16:40:22,629 - phyluce_snp_bwa_multiple_align - INFO - Argument --verbosity: INFO
2016-03-09 16:40:22,630 - phyluce_snp_bwa_multiple_align - INFO - ============ Starting phyluce_snp_bwa_multiple_align ============
2016-03-09 16:40:22,631 - phyluce_snp_bwa_multiple_align - INFO - Getting input filenames and creating output directories
2016-03-09 16:40:22,633 - phyluce_snp_bwa_multiple_align - INFO - ---------------------- Processing genus_species1 ----------------------
2016-03-09 16:40:22,633 - phyluce_snp_bwa_multiple_align - INFO - Finding fastq/fasta files
2016-03-09 16:40:22,636 - phyluce_snp_bwa_multiple_align - INFO - File type is fastq
2016-03-09 16:40:22,637 - phyluce_snp_bwa_multiple_align - INFO - Creating read index file for genus_species1-READ1.fastq.gz
2016-03-09 16:40:33,999 - phyluce_snp_bwa_multiple_align - INFO - Creating read index file for genus_species1-READ2.fastq.gz
2016-03-09 16:40:45,142 - phyluce_snp_bwa_multiple_align - INFO - Building BAM for genus_species1
2016-03-09 16:41:33,195 - phyluce_snp_bwa_multiple_align - INFO - Cleaning BAM for genus_species1
2016-03-09 16:42:03,410 - phyluce_snp_bwa_multiple_align - INFO - Adding RG header to BAM for genus_species1
2016-03-09 16:42:49,518 - phyluce_snp_bwa_multiple_align - INFO - Marking read duplicates from BAM for genus_species1
2016-03-09 16:43:26,917 - phyluce_snp_bwa_multiple_align - INFO - Creating read index file for genus_species1-READ-singleton.fastq.gz
2016-03-09 16:43:27,066 - phyluce_snp_bwa_multiple_align - INFO - Building BAM for genus_species1
2016-03-09 16:43:27,293 - phyluce_snp_bwa_multiple_align - INFO - Cleaning BAM for genus_species1
2016-03-09 16:43:27,748 - phyluce_snp_bwa_multiple_align - INFO - Adding RG header to BAM for genus_species1
2016-03-09 16:43:28,390 - phyluce_snp_bwa_multiple_align - INFO - Marking read duplicates from BAM for genus_species1
2016-03-09 16:43:30,633 - phyluce_snp_bwa_multiple_align - INFO - Merging BAMs for genus_species1
2016-03-09 16:44:05,811 - phyluce_snp_bwa_multiple_align - INFO - Indexing BAM for genus_species1
2016-03-09 16:44:08,047 - phyluce_snp_bwa_multiple_align - INFO - ---------------------- Processing genus_species2 ----------------------

Phasing mapped reads

In the previous step you aligned your sequence reads against the reference FASTA file for each sample. The results are stored in the output folder in bam format. Now you can start the actual phasing of the reads. This will analyze and sort the reads within each bam file into two separate bam files (genus_species1.0.bam and genus_species1.1.bam).

The program is very easy to run and just requires the path to the bam files (output folder from previous mapping program, /path/to/mapping_results) and the path to the configuration file, which is the same file as used in the previous step (/path/to/phasing.conf). Then, run:

.. code-block:: bash

phyluce_snp_phase_uces \
    --config phasing.conf \
    --bams multialign-bams \
    --output multialign-bams-phased-reads

The output will look something like the following

.. code-block:: text

2018-07-27 09:58:35,963 - phyluce_snp_phase_uces - INFO - ================ Starting phyluce_snp_phase_uces ================
2018-07-27 09:58:35,963 - phyluce_snp_phase_uces - INFO - Version: git 0babc1a
2018-07-27 09:58:35,963 - phyluce_snp_phase_uces - INFO - Argument --bams: /Users/bcf/tmp/phyluce/multialign-bams
2018-07-27 09:58:35,963 - phyluce_snp_phase_uces - INFO - Argument --config: /Users/bcf/tmp/phyluce/phasing.conf
2018-07-27 09:58:35,963 - phyluce_snp_phase_uces - INFO - Argument --conservative: False
2018-07-27 09:58:35,963 - phyluce_snp_phase_uces - INFO - Argument --cores: 1
2018-07-27 09:58:35,963 - phyluce_snp_phase_uces - INFO - Argument --log_path: None
2018-07-27 09:58:35,964 - phyluce_snp_phase_uces - INFO - Argument --output: /Users/bcf/tmp/phyluce/multialign-bams-phased-reads
2018-07-27 09:58:35,964 - phyluce_snp_phase_uces - INFO - Argument --verbosity: INFO
2018-07-27 09:58:35,964 - phyluce_snp_phase_uces - INFO - ================ Starting phyluce_snp_phase_uces ================
2018-07-27 09:58:35,964 - phyluce_snp_phase_uces - INFO - Getting input filenames and creating output directories
2018-07-27 10:02:10,526 - phyluce_snp_phase_uces - INFO - ================ Starting phyluce_snp_phase_uces ================
2018-07-27 10:02:10,527 - phyluce_snp_phase_uces - INFO - Version: git 0babc1a
2018-07-27 10:02:10,527 - phyluce_snp_phase_uces - INFO - Argument --bams: /Users/bcf/tmp/phyluce/multialign-bams
2018-07-27 10:02:10,527 - phyluce_snp_phase_uces - INFO - Argument --config: /Users/bcf/tmp/phyluce/phasing.conf
2018-07-27 10:02:10,527 - phyluce_snp_phase_uces - INFO - Argument --conservative: False
2018-07-27 10:02:10,527 - phyluce_snp_phase_uces - INFO - Argument --cores: 1
2018-07-27 10:02:10,527 - phyluce_snp_phase_uces - INFO - Argument --log_path: None
2018-07-27 10:02:10,527 - phyluce_snp_phase_uces - INFO - Argument --output: /Users/bcf/tmp/phyluce/multialign-bams-phased-reads
2018-07-27 10:02:10,527 - phyluce_snp_phase_uces - INFO - Argument --verbosity: INFO
2018-07-27 10:02:10,527 - phyluce_snp_phase_uces - INFO - ================ Starting phyluce_snp_phase_uces ================
2018-07-27 10:02:10,528 - phyluce_snp_phase_uces - INFO - Getting input filenames and creating output directories
2018-07-27 10:02:10,528 - phyluce_snp_phase_uces - INFO - ------------- Processing alligator_mississippiensis -------------
2018-07-27 10:02:10,528 - phyluce_snp_phase_uces - INFO - Phasing BAM file for alligator_mississippiensis
2018-07-27 10:02:21,695 - phyluce_snp_phase_uces - INFO - Sorting BAM for alligator_mississippiensis
2018-07-27 10:02:23,115 - phyluce_snp_phase_uces - INFO - Sorting BAM for alligator_mississippiensis
2018-07-27 10:02:24,533 - phyluce_snp_phase_uces - INFO - Creating REF/ALT allele FASTQ file 0
2018-07-27 10:02:24,583 - phyluce_snp_phase_uces - INFO - Creating REF/ALT allele FASTQ file 1
2018-07-27 10:02:24,613 - phyluce_snp_phase_uces - INFO - Creating REF/ALT allele FASTQ file unphased
2018-07-27 10:02:24,643 - phyluce_snp_phase_uces - INFO - Creating REF/ALT allele FASTA file 0 from FASTQ 0
2018-07-27 10:02:24,654 - phyluce_snp_phase_uces - INFO - Creating REF/ALT allele FASTA file 1 from FASTQ 1
2018-07-27 10:02:24,662 - phyluce_snp_phase_uces - INFO - Creating REF/ALT allele FASTA file unphased from FASTQ unphased
2018-07-27 10:02:24,673 - phyluce_snp_phase_uces - INFO - Checking for correct FASTA files
2018-07-27 10:02:24,673 - phyluce_snp_phase_uces - INFO - Cleaning FASTA files
2018-07-27 10:02:24,681 - phyluce_snp_phase_uces - INFO - Balancing FASTA files
2018-07-27 10:02:24,682 - phyluce_snp_phase_uces - INFO - Symlinking FASTA files

The program automatically produces a consensus sequence for each of these phased bam files (= allele sequence) and stores these allele sequences of all samples in a joined FASTA file (joined_allele_sequences_all_samples.fasta). This allele FASTA is deposited in the subfolder fastas within your output folder (e.g. /path/to/multialign-bams-phased-reads/fastas).

You can directly input that file (joined_allele_sequences_all_samples.fasta) back into the alignment pipeline, like so:

.. code-block:: bash

phyluce_align_seqcap_align \
    --fasta /path/to/multialign-bams-phased-reads/fastas/joined_allele_sequences_all_samples.fasta \
    --output PHASED-DATA mafft-nexus-edge-trimmed \
    --taxa 4 \
    --aligner mafft \
    --cores 12 \
    --incomplete-matrix \
    --log-path log

Following alignment, you can choose how you’d like to treat these data (e.g. internally trim, analyze, etc).

Locus filtering

Locus filtering is the final step before phylogenetic analysis can happen. Filtering out uninformative or largely incomplete loci can improve performance and efficiency. In our pipeline there are generally two types of locus filtering we’ll be performing:

After filtering for completeness, we need to “clean up” the locus files. This can be done with the following command:

phyluce_align_remove_locus_name_from_nexus_lines \
    --alignments muscle-nexus-75p \
    --output muscle-nexus-clean-75p \
    --cores 19

This adds a new folder called muscle-nexus-clean-75p that contains the same alignments as muscle-nexus-75p, but the locus names have been removed from the taxon names in the file. We can go ahead and remove the “uncleaned” version of the folder:

rm -r muscle-nexus-75p

rm is the “trash” command of Bash. The -r flag specifies that you want to trash “recursively”, meaning that you can trash a directory as well as all of the files and directories inside of it.

Filtering by parsimony-informative sites

Filtering by parsimony-informative sites (PIS) generally means you are filtering out loci that have below a certain number of PIS. Loci with low information content can bias your results so it’s good to filter them out. To calculate how many PIS are in each locus, and perform various types of filtering, we will use an R script written by Brown lab alumnus Connor French that makes use of the package Phyloch. If you haven’t already, download pars_inform.R from the example-files directory of this repository.

To use R in the terminal, you have to have it installed on your system. Then just type in R. Until you leave R, the terminal will now assume all of your commands are written in R, instead of Bash.

Generally my procedure here is to simply open pars_inform.R in a text editor and copy-and-paste commands in chunks into the terminal. You will need to alter parts of the file to suit your needs. The first part of the file looks like this:

#script to count parsimony informative sites and output to csv
#load phyloch

#setwd where the reads are

setwd is the command that sets your “working directory,” which is where R assumes any files specified (and written) will be located. Here I’ve set the working directory to 5_taxon-sets/all. You may need to alter it depending on your directory structure.

Next run the following chunk of code, which does most of the dirty work in calculating the PIS in each locus and other associated information. It will take a minute to finish.

#get list of file names and number of files
listoffiles <- list.files(pattern="*.nex*")
nooffiles <- length(listoffiles)

#these are the column names
record <- c("locusname","pis","length")

#loop to calculate PIS and write the PIS info to a text file
for (j in 1:nooffiles) {
  tempfile <- read.nex("list_of_pis_by_locus.txt")
  templength <- dim(tempfile)[2]
  temppis <- pis(tempfile)
  temp <- cbind(listoffiles[j],temppis,templength)
  record <- rbind(record,temp)
#add column names to text file
write.table(record, "list_of_pis_by_locus.txt",quote=FALSE, row.names=FALSE,col.names=FALSE)

This command will output a file named list_of_pis_by_locus.txt in your muscle-nexus-clean-75p directory that contains a list of every locus, its number of PIS, and the locus’ length.

Now you need to decide what kind of filtering to perform. I have performed two types of filtering for various projects: The first is the most common, when you went to retain only loci that have PIS within a certain range of values. First, run this chunk:

###calculate summary data and visualize PIS variables###
par_data <-"list_of_pis_by_locus.txt", header = TRUE))
plot(par_data$pis) #visualize number of PIS

This should print a plot to the screen of each locus (x) and its number of associated PIS (y). In my case, the number of PIS is strongly clustered below 5 for most loci, with many of them having 0 PIS. Here is my graphical output:
PIS graph
You can use this graph to inform what you want your thresholds for retaining PIS to be. When doing this, I generally remove low-informative sites as well as highly-informative outliers. Given this distribution, I’d like to remove loci with fewer than 3 PIS, and more than 15 (3 < PIS < 15). Use the following code block:

###calculate summary data and visualize PIS variables###
par_data <-"list_of_pis_by_locus.txt", header = TRUE))
plot(par_data$pis) #visualize number of PIS
inform <- subset(par_data, pis >= 3) #subset data set for loci with PIS > 3
inform <- subset(inform, pis < 15) #remove outliers (change based on obvious outliers to data set)
plot(inform$pis) #visualize informative loci
length(inform$pis) #calculate the number of informative loci
sum(inform$pis) #total number of PIS for informative loci
fivenum(inform$pis) #summary stats for informative loci (min, lower quartile, median, upper quartile, max)
inform_names <- inform$locusname #get locus names of informative loci for locus filtering
#write locus names to a file
write.table(inform_names, file = "inform_names_PIS_15.txt",quote=FALSE, row.names=FALSE,col.names=FALSE)

This writes a file named inform_names_PIS_15.txt to muscle-nexus-clean-75p. The file contains the names of all UCE loci that meet the criteria you set in the above code block (3 < PIS < 15).

In one of my projects, I desired instead to find the 200 most parsimony-informative loci rather than filter for an unknown number that fit specific criteria. This involves ranking the loci by number of PIS and then taking the top 200 (or whatever number you want). To do this, run this code block:

#find 200 most parsimony-informative loci and save their names
pistab <- read.delim("list_of_pis_by_locus.txt", header=TRUE, sep = " ")
pistab_sorted <- pistab[order(-pistab$pis),]
pistab_sorted_truncated <- pistab_sorted[1:200,]
topnames <- pistab_sorted_truncated$locusname
write.table(topnames, file = "top200names.txt",quote=FALSE, row.names=FALSE,col.names=FALSE)

This writes a file named top200names.txt to muscle-nexus-clean-75p that is similar in structure and principle to inform_names_PIS_15.txt.

JLB pro tip To quit R, type: quit("yes")

JLB UPDATE 10/2020 The follow section is likely not needed (I adjusted the R code), just check to make sure your output matches the format in the second block.

JLB UPDATE 10/2020: Ignore START ************

Now we have a list of the loci we desire (whether loci that fit a certain range of PIS or the most informative X loci). We now need to use this list to grab the .nexus alignment files for those loci. First we need to check to see that the list needs to be modified. Does the list file should look something like this? If so follow the following bit. In the end we want it to look like the next example (just names - no quotes or row numbers). If yours currently looks like this, then skip this section.

"1" ""
"2" ""
"3" ""
"4" ""
"5" ""

We need to change it so that all that remains are .nexus filenames. We can use regular expressions to do this. If you don’t know, regular expressions are code constructs that are commonly used to search and replace specific patterns in text. They are very intimidating to look at at first, but learning them can make you a badass. You can use Bash commands like sed and awk to use regular expressions in the Terminal, but we can also do it using the find-and-replace (ctrl+H) utility in gedit. Other text editor programs like Notepad++ also support the use of regular expressions.

First, go ahead and manually remove the first line "x". Then, open find-and-replace and make sure the “Regular expression” box is ticked. In the “Find” box, type in "\d+" "(.*.nexus)". This is the “search” portion of the regular expression. Then, in the “Replace” box, simply type \1. This is the “replace” portion. If you click “Find”, the program should highlight everything in the file. Click “Replace All”, and you should be left with only the file names, no quotes. Doesn’t that feel good?

JLB UPDATE 10/2020: Ignore END

The file should now look like this:

Regular expression explained: We build the regular expression sequentially to match each line. We need to find the commonalities between each line and represent them using different constructs. First we use a " quote to match the starting quote of each line. Then, we use \d+ to match “one or more digits”, which are in the first column. We add " " to match the quote, space, quote, that follows the first number. Next we use .* as a “wildcard” (analogous to the * of Bash), which matches “anything except a line break zero or more times”. This construct matches the “uce-4885” part of the first line. Then we close with .nexus", which matches the last portion of the line. Note that the construct .*.nexus is enclosed in (parentheses). We do this to “capture” whatever is inside of the parentheses, which we can then use to Replace our match. The Replace construct \1 signifies the first “captured” match, which is that bit in the parentheses. So basically, we are searching for the whole line, capturing the part we want (the filename), and then replacing the whole line with the capture.

Now that that’s done, we have to take our cleaned list of loci and grab the corresponding .nexus alignment files. First, quit R with q(). Then, enter the following Bash commands (Make sure you are working from the directory ‘tutorial/5_taxon-sets/all/’):

mkdir muscle-nexus-clean-75p_3
cd muscle-nexus-clean-75p
for n in $(cat inform_names_PIS_15.txt); \
   do cp "$n" ../muscle-nexus-clean-75p_3; \
cd ..

We first create a new directory muscle-nexus-clean-75p_3 to place our filtered loci in (the _3 part signifies that we filtered loci with fewer than 3 PIS). Then, we go into the muscle-nexus-clean-75p directory. The final section is a for loop that, for every line in inform_names_PIS_15.txt (i.e., for every locus with 3 < PIS < 15), copy that locus’ corresponding .nexus file from muscle-nexus-clean-75p to muscle-nexus-clean-75p_3. Finally, we go back to the all directory to prepare for the next step.

After running the command, muscle-nexus-clean-75p_3 contains 385 loci. That’s quite a lot of filtering. You may wish to filter less stringently in order to retain more loci. I’ll be working on the loci in this folder for the remainder of the tutorial.

Concatenating alignments

The final step before phylogenetic analysis is to concatenate your set of filtered loci into a single alignment. Concatenation in this context means that your alignments will be combined into a single file, and lined up one after another. Note that concatenation is a problematic assumption in phylogenetics and is mostly used for convenience; this is because you’re assuming that all of your loci are evolving as a “supergene”, when we know in fact this is not the case. We can use “partitions” to divide the concatenated matrix into portions corresponding to each locus, and then assign a different substitution model to each; while more technically “correct,” this does have the side effect of making your phylogenetic analysis take way longer. In this tutorial we will not be using partitions, although I will show you how to generate them automatically if you want to use them.

To concatenate the loci in muscle-nexus-clean-75p_3, use this command (Make sure you are working from the directory ‘tutorial/5_taxon-sets/all/’):

phyluce_align_format_nexus_files_for_raxml \
    --alignments muscle-nexus-clean-75p_3 \
    --output muscle-nexus-iqtree-75p_3 \

The output will be a new folder named muscle-nexus-iqtree-75p_3 that contains two files: muscle-nexus-clean-75p_3.phylip, a PHYLIP-formatted alignment file containing all 385 retained loci, and muscle-nexus-clean-75p_3.charsets, a list of partitions that was generated because we specified the --charsets option. This list specifies the location of each locus in the alignment and can be used to specify partitions for phylogenetic analysis, although we will not be using it.

Phylogenetic analysis

Now we’re finally at the good part. This isn’t intended to be a comprehensive guide for constructing phylogenies but I will show you how to perform some basic analyses using various methods. The previous steps outlined in this guide are basically a pipeline for preparing your data for this step, which is the actual data analysis part of your project.

Now that we’re doing using PHYLUCE, you should switch back to Java 8 using sudo update-alternatives --config java.

Maximum likelihood analysis with RAxML

I think of maximum likelihood (ML) methods as like the peanut-butter-and-jelly sandwich of phylogenetics. A PB&J isn’t the most delicious, rigorously constructed sandwich (like a Bayesian Lettuce and Tomato), but it’s certainly better than your grandpa’s Parsimonious Mayonnaise-only sandwich. A PB&J is serviceable, quick, and always there when you need it. There are several ML methods commonly in use for phylogenetics, but by far the most widely-used is RAxML.
First, make a new RAxML directory, copy your PHYLIP file into it, and go into it yourself:

mkdir muscle-nexus-raxml-75p_3 
cp muscle-nexus-iqtree-75p_3/muscle-nexus-clean-75p_3.phylip muscle-nexus-raxml-75p_3/
cd muscle-nexus-raxml-75p_3 

Here’s the RAxML command I used:

~/Documents/RAxML/raxmlHPC-PTHREADS-AVX \
	-f a \
	-x $RANDOM \
	-p $RANDOM \
	-# autoMRE \
	-s muscle-nexus-clean-75p_3.phylip \
	-n muscle-nexus-raxml_75p_3 \
	-T 19

To see more of RAxML’s options, you can type ~/Documents/RAxML/raxmlHPC-PTHREADS-AVX -h | less -S. Note that when I call RAxML (~/Documents/RAxML/raxmlHPC-PTHREADS-AVX) I am specifyng the path to its location, which is in my Documents folder. Yours may differ.

Running only 6 taxa, 385 loci, and over 19 cores, the analysis finished in a matter of seconds. You can view the tree by installing FigTree. To open it on a Linux machine, simply type figtree into the Terminal (note that you have to be running Java 8, not Java 7). Open the file RAxML_bipartitions.muscle-nexus-raxml_75p_3. Figtree contains numerous options for manipulating images of phylogenetic trees. You can reroot, rotate branches, adjust branch widths, create a circle tree, and more. Here is a modified image of my RAxML tree:

raxml tree

This is more or less in line with what we know of the Ameerega phylogeny. The node labels are boostrap values, a measure of support for that node. You can add them by selecting “label” as the displayed value for node labels in FigTree. JLB Update viewing trees - 10/2020 Viewing trees via FigTree is perfectly fine, however I hate changing Java back and forth. Thus, I prefer to use the website TreeView. Select ‘Tree in Newick Format’ and select the file “RAxML_bestTree.muscle-nexus-raxml_75p_3”.

JLB Update on running RAxML and installing RAxML/IQ-TREE - 10/2020 I strongly suggest you install raxml a more traditional way. Here’s the RAxML command I used:

sudo apt update
sudo apt install raxml
sudo apt install iqtree

Then to run your analysis type this:

	-f a \
	-x $RANDOM \
	-p $RANDOM \
	-# autoMRE \
	-s muscle-nexus-clean-75p_3.phylip \
	-n muscle-nexus-raxml_75p_3 \
	-T 7

Maximum likelihood analysis with IQ-TREE

RAxML is very popular, but I actually prefer to use IQ-TREE, an alternative ML phylogenetics program. I like IQ-TREE because it has great documentation (and a flashy website), flexible and easy-to-understand options, and most importantly it integrates model selection with a wide array of substitution models, certainly more than are offered by RAxML. I generally use IQ-TREE for all of my ML analyses.

We already made a directory for IQ-TREE when we concatenated our alignments. Go ahead and go into it with cd muscle-nexus-iqtree-75p_3. Then run the following command:

iqtree \
-s muscle-nexus-clean-75p_3.phylip \
-bb 10000 \
-m GTR \
-nt 19

JLB Note 10/2020 The above syntax had some issues for me, if you have an error type manually typing this syntax or paste this: iqtree -s muscle-nexus-clean-75p_3.phylip -bb 10000 -m GTR -nt 7

There are many many more options and things you can do with IQ-TREE - this is about as basic of an analysis as you can do with it. The output tree is stored in the file muscle-nexus-clean-75p_3.phylip.treefile, which I show below:

iqtree tree
As you can see, it is identical in topology to my RAxML tree, but with slightly higher overall bootstrap values and slightly different branch lengths. When comparing the two programs, I find that this is a consistent pattern.

Coalescent analysis with ASTRAL

One thing I had to learn with phylogenetics is the difference between a gene tree and a species tree. When you run a phylogenetic analysis over a single gene like cox1 (or a concatenated matrix), you’re really reconstructing the evoutionary history of that gene, not the species involved. The tree can be misleading because generally the labels make us think that the tips represent the whole species, but really it’s just a gene of that species. In the 000’s, a suite of techniques were developed for integrating coalescent theory, one of the foundations of population genetics, with phylogenetic methods. Basically, coalescent methods track the individual histories of each gene and then use those to construct a species tree, which is a wholesale representation of the separate genealogical histories contained within that organism’s genome. This accounts for “incomplete lineage sorting”, which is the phenomenon that occurs when two or more genes within the same organism have different evolutionary histories, leading to “gene tree discordance”.

There are many different coalescent methods, and many of them work in slightly different ways. The one I like to use the most is called ASTRAL, which is a summary method. Basically, you feed it a bunch of individual gene trees (constructed using your method of choice), and summarizes them and spits out a species tree. You also need to assign each sample in your set to a putative species, and ASTRAL will “coalesce” them into a single tip. Thus, there is a bit of prior knowledge necessary to use this technique.

Constructing gene trees with IQ-TREE for ASTRAL input

The first step of an ASTRAL analysis is to construct the gene trees. We will do this for the PIS-filtered loci using IQ-TREE embedded in a for loop.

cd ..
mkdir muscle-nexus-iqtree-genetrees-75p_3
mkdir muscle-nexus-astral_3
cd muscle-nexus-iqtree-genetrees-75p_3/
cp ../muscle-nexus-clean-75p_3/* .

First we go back into all, then we make a new directory to hold the gene trees, as well as an ASTRAL folder, which we will put relevant files in later. We go into the gene trees folder, then we copy all of the PIS-filtered loci into this directory (not very efficient, I know). Here’s the command to run the gene trees.

for i in *.nexus; do ~/Desktop/Bioinformatics/iqtree-1.6.5-Linux/bin/iqtree -s $i -bb 1000 -m GTR -nt AUTO -czb -redo; done

JLB update 10/2020 For this code to work for me I had to remove the software version/location:

for i in *.nexus; do iqtree -s $i -bb 1000 -m GTR -nt AUTO -czb -redo; done

Forgive my crude presentation of this command, but it is tried and true. Basically, we are simply looping over every .nexus file in this folder and running it through IQ-TREE, with 1,000 ultrafast bootstrap replicates and a GTR substitution model. The -nt AUTO option specifies that the program will automatically decide how many threads to use, which was apparently necessary for this run. The -czb option contracts very short branch lengths (common in gene trees) to polytomies, which reduces gene tree bias. The -redo option allows you to repeat the command without manually deleting previous results, which is handy when you’re troubleshooting. Unfortunately, since the -nt AUTO option was necessary, many of the trees were run with 1 core only, so the analysis took a decent amount of time - about seven hours. With more cores, this would have been shorter so you might try experimenting with small numbers rather than going for the AUTO option. When running a full gene tree analysis with ~1000 loci and >30 taxa, this process usually takes a day or so even with the full compliment of cores.

Note that in the above command, I call the full filepath to an IQ-TREE v 1.6.5 executable, rather than just calling iqtree like I did in the previous section. On my machine, calling iqtree only calls v 1.5.5, which does not have the -czb option.

When the command is done running, the folder muscle-nexus-iqtree-genetrees-75p_3 should contain a multitude of files, including 385 .treefile files that contain a gene tree for each locus. The remaining step is to concatenate all of our treefiles into a single file with cat:

cat *.treefile > ../muscle-nexus-astral_3/gene_trees.newick

This file is stored in the muscle-nexus-astral_3 folder in all. It should contain a list of 385 trees. This is the main input for ASTRAL.

Creating a mapping file for ASTRAL

The next step is to create a mapping file that categorizes each of your samples by species, which allows ASTRAL to coalesce multiple samples of the same species into a single tip. Note that if your sample already has only one sample per species, you do not need a mapping file, and ASTRAL will correctly assume that that is the case.

It would be more work than it’s worth to create the mapping file entirely in Terminal, so we’ll mostly do it by hand. It’s fairly straightforward. First, let’s go back into the all folder with cd ... Then, use this command to copy taxa.txt (already a list of our samples) and write it to a file named sp_map.txt in all:

cp ../../taxa.txt muscle-nexus-astral_3/sp_map.txt

We now need to edit sp_map.txt manually to assign each sample to a species. First we need to remove the [all] header at the top; we don’t need that for this. The basic structure of the mapping file is simply the species, followed by a colon, then the different samples that comprise that species, separated from each other by commas. The only case where we have more than one sample per species is for our Ameerega bassleri samples, so we’ll put them together. Also, the IQ-TREE runs changed our samples’ names to having _ underscores rather than - dashes between name components, so we need to manually change the dashes to underscores in the mapping file (use a simple find-and-replace). Your file should look like this after editing:


Running ASTRAL

Ironically, running the gene trees for ASTRAL is the longest part of the process, whereas ASTRAL itself will finish running in a minute or two. First, go into the ASTRAL directory with cd muscle-nexus-astral_3. Then, run this command from that directory, making sure that gene_trees.newick and sp_map.txt are both in there as well:

java -jar ~/Documents/Astral/astral.5.6.1.jar \
	-i gene_trees.newick \
	-o astral_sptree.treefile \
	-a sp_map.txt

The output file is called astral_sptree.treefile. Here is what mine looks like after some editing in FigTree:

astral tree
Again, our topology is identical to the previous trees, with the exception that our two A. bassleri samples have been “coalesced” into a single tip. Also note that the tip labels have been changed to reflect the species assignments in the mapping file. Finally, note that support measures are in local posterior probabilities rather than bootstrap values.

Bayesian analysis with BEAST

Bayesian phylogenetic methods are commonly regarded as “the best” and “most robust” in the business, but also infamously elicit moans and groans (at least from me) because of the considerable computational power and time required to run them. BEAST 2 is one of the most popular of these programs, and I will be using it to demonstrate a Bayesian analysis. A detailed tutorial explaining the ins and outs of priors, convergence, ESS values, etc. is beyond the scope of this guide so I will assume you’ve done at least a little reading up on Bayesian methods.

Subsetting loci for BEAST

There are three things that can make a BEAST run take an intractably long amount of time to converge: having lots of sequence data (=loci), lots of taxa (=samples), or too few threads (=cores). We know we’ll be using 19 cores, and we already have a very small number of taxa (n=6), but we still have a decent number of loci (n=385). We’re going to further subset our collection of loci to increase computational efficiency. Doing this does suck because you’re effectively ignoring a large portion of your data, but as one reviewer once told me, “Nobody has seventy years to wait for their BEAST run to converge”.

There are a couple ways to subset your loci. One is to use the X most parsimony-informative loci, which I showed you how to find in this section. Another is to use a subset of random loci. We’ll do this for demonstrative purposes. To get a list of, let’s say 50 random loci, use this command (make sure you’re in the all directory):

shuf -n50 muscle-nexus-clean-75p/inform_names_PIS_3.txt > random_loci_n50.txt

This creates a file random_loci_n50.txt in all that contains a list of 50 .nexus files, randomly selected from your PIS-filtered set of 385 loci. In the past, I have obtained a list of 200 random loci and then split it further into 4 random subsets of 50 loci each with the command split -n 4 random_loci_n200.txt. We will not be doing this as 50 loci is few enough as is.

Once we have our list, we can copy the loci listed in random_loci_n50.txt into a new folder, muscle-nexus-beast_3_n50:

mkdir muscle-nexus-beast_3_n50
for i in $(cat random_loci_n50.txt); \
	do cp muscle-nexus-clean-75p_3/$i muscle-nexus-beast_3_n50/; \

The for loop identifies each locus listed in random_loci_n50.txt, and then copies those loci from muscle-nexus-clean-75p_3 to muscle-nexus-beast_3_n50.

Finally, we need to concatenate our loci into a single alignment, using this command:

phyluce_align_format_nexus_files_for_raxml \
	--alignments muscle-nexus-beast_3_n50 \
	--output muscle-nexus-beast_3_n50/beast_n50 \

This created a new subfolder, beast_n50, inside the folder muscle-nexus-beast_3_n50, and put a .nexus file,, into it. Go ahead and cd into beast_n50, as we’ll be working from here for the rest of this tutorial. Note that we used the --nexus option to specify the output to be in .nexus format rather than .phylip.

Setting up a BEAST run with BEAUti

BEAST is a bit different than the other phylogenetics programs we’ve used because it is only based partially on the command line. Bayesian phylogenetics programs tend to have more settings (due to prior selection) than others, so the authors of BEAST decided to make a GUI program called BEAUti to make setting these priors more intuitive. If you have BEAUti installed, go ahead and open it simply by typing beauti (make sure Java 8 is on).

BEAUti will open up in a new window. Go ahead and select “File” and open up via the “Import alignment” option. You should see a list of six tabs near the top of the screen. We have to go through and provide settings for each one:

When you’re done with all of those settings, save the resulting .xml file as beast_n50.xml. It should be stored in your beast_n50 folder.

Running BEAST

Starting the actual BEAST run is very straightforward. Make sure you’re in the beast_n50 directory, and type:

beast -threads 19 beast_n50.xml

There are so few options because the beast_n50.xml contains all of our settings as well as our sequence data inside of it. Using 19 cores, the run took my computer 50 minutes to finish - your mileage may vary. Note that for a real, publishable BEAST analysis, you will be using a much larger dataset (most likely) and the analysis will take much, much longer. Also, it is best practice to do each run twice to assess whether they converge to the same values - effectively doubling the amount of time required. In the past I ran BEAST for 4 subsets of 50 loci each (200 loci total), for 36 taxa. To run each subset twice (amounting to 8 runs) took about 2 weeks total.

Your output will be stored as muscle-nexus-beast_3_n50.trees, which is a collection of trees representing the posterior distribution obtained by BEAST.

Processing BEAST output

It’s a good idea to use Tracer to assess whether your BEAST run converged before you do anything else. Open Tracer with the following command:

java -jar ~/Desktop/Bioinformatics/Tracer_v1.7.1/lib/tracer.jar

This will open a new window. Click the + button to load the Trace file, in this case muscle-nexus-beast_3_n50.log. The left-hand part of the window will show mean values for parameter estimates, as well as ESS values for each one. A common rule of thumb is that all ESS values should be > 200. For this run, we have some as low as 47, which means we probably should have added more generations to our MCMC chain (usually I use 100,000,000 rather than 10,000,000).

To get our BEAST output into a format where we can visualize it as a single tree, we need to pipe it through a couple of other programs that should have been installed along with BEAST and BEAUti. If you ran more than one BEAST run and you’d like to create a single tree from your combined runs, use the program LogCombiner to combine tree and log files into one. Just type in logcombiner to open the program. In this tutorial, we only did one BEAST run, so running LogCombiner is unnecessary.

We now use TreeAnnotator to visualize our posterior “tree-cloud” as a single tree. Open TreeAnnotator by typing treeannotator.

After you click “Run,” the program will run for a few moments and spit out beast_n50.treefile. Since this was a pretty flawed BEAST run, our output isn’t anything visualizable as it has very very short branch lengths - but this exercise took you through the motions of doing a Bayesian divergence time analysis in BEAST.

JLB Install notes to be used alongside the Phyluce install guide

If you do a new install, please install java 1.7 and switch to it before installing conda and phyluce, seriously.
To install Java 7 type “sudo apt-get install openjdk-7-jre”. If this doesn’t work see To switch (on a Linux machine), use the following command: sudo update-alternatives --config java. The terminal will prompt you to enter the computer’s password. Do so, and then select which version of Java to use. In this case, you will want to switch to Java 7 (for me, the choice looks like /usr/lib/jvm/jdk1.7.0_80/bin/java).

-Quick place to find miniconda To install then ‘cd to download folder’ and ‘bash’

-If python doesnt show ‘anaconda…’ type ‘export PATH=$HOME/miniconda2/bin:$PATH’ -To add channels: conda config –add channels defaults conda config –add channels bioconda conda config –add channels conda-forge

Where to get older [GATK] - minght not be needed (

To install java from conda (not actually needed) ‘conda install -c anaconda java-1.7.0-openjdk-cos6-x86_64’