Wednesday 18 March 2015

Installing ANNOVAR in Galaxy

ANNOVAR (http://www.openbioinformatics.org/annovar/) is a popular tool used to functionally annotate genetic variants detected from various genomes. The Galaxy toolshed includes a tool called table_annovar which can be used to run ANNOVAR. Installation of the tool into a local Galaxy instance is not fully automated and requires some manual steps which are sketched in the tool's README; this post expands on those basic instructions to hopefully make the process easier for others.

Note that these instructions are for the '2014-02-12' revision of the table_annovar tool (changeset 6:091154194ce8), installing into the latest_2014.08.11 version of galaxy-dist.

1. Install the table_annovar tool from the toolshed

This is the devteam owned tool on the main toolshed:

https://toolshed.g2.bx.psu.edu/view/devteam/table_annovar

and can be installed via the usual admin interface within Galaxy (see for example https://wiki.galaxyproject.org/Admin/Tools/AddToolFromToolShedTutorial).

2. Install the ANNOVAR software

In addition to the Galaxy tool you also the actual ANNOVAR software. To download a copy you first need to register at:

http://www.openbioinformatics.org/annovar/annovar_download_form.php

Once registered you should receive a link to download the latest version (e.g. annovar-2014nov12.tar.gz). Note that ANNOVAR's licensing conditions prohibit commercial use without a specific agreement, and that users are not permitted to redistribute ANNOVAR to others, including lab members.

Unpack the tar.gz file into a directory where it can be executed by your Galaxy user. For example:

# Make a location for ANNOVAR
$ mkdir -p /home/galaxy/apps/annovar/
# Move into this directory
$ cd /home/galaxy/apps/annovar/
# Unpack the ANNOVAR software
$ tar zxf /path/to/annovar-2014nov12.tar.gz
# Rename the unpacked directory to '2014nov12'
$ mv annovar 2014nov12

This puts the ANNOVAR programs into the directory /home/galaxy/apps/annovar/2014nov12. The actual location isn't so important as long as you know where it is so you can reference it in the next section.

3. Set up the Galaxy environment to make ANNOVAR available to the tool

Essentially we need to manually create the files and directories that Galaxy will use to set the environment appropriately when the ANNOVAR tool is run.

This needs to be done in the directory pointed to by the tool_dependency_dir variable in your Galaxy configuration file (either galaxy.ini or universe_wsgi.ini, depending on the age of your Galaxy distribution) - by default this is ../tool_dependencies (which is relative to your galaxy-dist directory).

Under this directory make a subdirectory for ANNOVAR, for example:

$ cd tool_dependencies/
$ mkdir -p annovar/2014nov12

In the 'annovar' directory make a symbolic link to point to this default version:

$ cd annovar/
$ ln -s 2014nov12 default

Then in the '2014nov12' dir make a file called env.sh file which looks like:

$ cd 2014nov12/
$ cat <
export PATH=/home/galaxy/apps/annovar/2014nov12:$PATH
EOF

(Substitute the directory that you unpacked the ANNOVAR software into in the previous step.) Galaxy will source this file when running the ANNOVAR tool in order to make the underlying programs available.

4. Add the 'annovar_index' data table to the master list of data tables

The ANNOVAR tool gets information about the installed databases from a file called annovar_index.loc. For the version of Galaxy that I'm using there is already a copy of this file in the galaxy-dist/tool-data directory, but the tool won't pick up any databases referenced there until we add the following to the end of the tool_data_table_conf.xml:

    <!-- Location of ANNOVAR databases -->
    <table comment_char="#" name="annovar_indexes">
        <columns>value, dbkey, type, path</columns>
        <file path="tool-data/annovar_index.loc"> </file>
    </table>

Important: this must appear before the closing </tables> tag in the file!

(Note that you will need to restart Galaxy after this step to get it to pick up the data table.)

It's possible that newer versions of Galaxy might not include annovar_index.loc, in which case you'll need to locate the copy that's supplied in the tool itself and copy that to the tool-data directory. The following Linux command (executed from galaxy-dist) should do the trick:

$ find tool-data -name "annovar_index.loc"

5. Install ANNOVAR databases and update the .loc file

At this point the tool is almost set up; it's just missing any actual databases to work with.

The list of available databases can be found here:

http://www.openbioinformatics.org/annovar/annovar_download.html

and they can be downloaded using the annotate_variation.pl script (which is part of ANNOVAR).

It is important to note that the ANNOVAR tool expects all the database files for a specific genome build to be in the same directory.

As an example: say we want to make the hg19 refGene and ensGene databases available in the Galaxy tool. In this case we first download the data:

$ cd /home/galaxy/data/annovar/
$ /home/galaxy/apps/annovar/2014nov12/annotate_variation.pl -downdb -buildver hg19 -webfrom annovar refGene hg19
$ /home/galaxy/apps/annovar/2014nov12/annotate_variation.pl -downdb -buildver hg19 -webfrom annovar ensGene hg19

This will download the files for both databases to a subdirectory called 'hg19' under /home/galaxy/data/annovar/ (you should choose or make your own location as appropriate).

Then update annovar_index.loc to point to these data. The header of the .loc file specifies the format of each line, but essentially each database should be described by a line with four tab-separated fields:

  • Database name: the text that appears for the database within the ANNOVAR tool)
  • Genome build
  • Database type: the ANNOVAR databases are divided into three types: "gene annotations" ('gene_ann'), "annotation regions" ('region') and "annotation databases" ('filter') -  empirically, if a database download contains '.fa.' files it appears to be 'gene_ann', if it contains '.idx' files then it's 'filter'
  • Path to the directory holding the downloaded data files

For the refGene and ensGene example above, for both the genome build is 'hg19', the data are of type 'gene_ann', and the directory holding the files is /home/galaxy/data/annovar/hg19/. So the .loc file entry will look like:

refGene   hg19 gene_ann /home/galaxy/data/annovar/hg19

Finally, you will need to restart Galaxy to refresh the available databases for the ANNOVAR  tool (or if you only have a single Galaxy server running then you can use the option under "Manage local data (beta)" in the "Admin" interface to reload the data).

6. Troubleshooting

It's recommended to run a few example ANNOVAR jobs to check that everything is set up correctly. Some problems that I've encountered in the past include:

#1 The expected databases don't appear as options in the tool

Only databases which match the genome build assigned to the input dataset will be presented as options. Check that the input dataset has been assigned to the correct genome build.

#2 The job produces an empty output file when annotating against a single database

Check the log for the a line like:

convert2annovar.pl: command not found

which suggests that the env.sh file created in step #3 above is not correct.

#3 The job produces an empty output file when annotating against multiple databases

First check for the previous error; if this isn't the case then check the stderr output for a message like:

Error: the required database file ... does not exist.

which suggests a problem with your annovar_index.loc file. Check that the database file does indeed exist, and that all the data files for the genome build are in the same directory (see step #5 above).

2 comments: