Galactic Engineer: 2015

Tuesday, 2 June 2015

Exposing Galaxy reports via nginx in a production instance

Galaxy includes a report tool that is separate from the main process but which gives lots of potentially useful information about the usage of a Galaxy instance, for example the numbers of jobs that have been run each month, how much disk space each user is currently consuming and so on.

However there doesn't appear to be much documentation about the report tool on the official Galaxy wiki: the most I could find was a rather sparse page at https://wiki.galaxyproject.org/Admin/UsageReports, which gives a very bare bones overview, and doesn't include any information on how it might be exposed in a secure manner in a production environment. Therefore in this post I outline how I've done this for our local Galaxy set up, which uses nginx; however I imagine it could be adapted to work with Apache instead.

1. Set up up the report tool to run on localhost

The report tool takes its configuration settings from a file called reports_wsgi.ini, which is located in the config subdirectory of the Galaxy distribution.

Configuring the reports for your local setup is a case of:

Making a copy of reports_wsgi.ini.sample called reports_wsgi.ini
Editing the database_connection and file_path (if not the default) parameters to match those in your galaxy.ini (or universe_wsgi.ini) file
Optionally, editing the port parameter (by default the tool uses port 9001)
You should also set the 'salt' parameter session_secret if you intend to expose the reports via the web proxy (see below)

Then you can start the report server using

sh run_reports.sh

and view the reports by pointing a web browser running on the same server to http://127.0.0.1:9001.

If you'd like the report tool to persist between sessions then use

sh run_reports.sh --daemon

to run it as a background process. As with Galaxy itself, use --stop-daemon to halt the background process. (The log file is written to reports_webapp.log if you need to try and debug a problem.)

2. Expose the report tool via nginx

If you're running a production Galaxy and want to be able access the reports from a browser running on a different machine to your Galaxy server then you can could consider using SSH tunnelling, which essentially forwards a port on your local machine to one on the server i.e. port 9001 where the report tool is serving from (see "SSH Tunneling Made Easy" at http://www.revsys.com/writings/quicktips/ssh-tunnel.html for more details of how to do this).

Alternatively if you are using a web proxy (as is standard for a production setup) then you could try serving the reports also via the proxy (in this case nginx). In this example I assume that if Galaxy is being served from e.g. http://galaxy.example.org/ then the reports will be viewed via http://galaxy.example.org/reports/.

First, make the appropriate edits to reports_wsgi.ini: if you have an older Galaxy instance then you'll need to add some sections to the file, specifically:

[filter:proxy-prefix]

use = egg:PasteDeploy#prefix

prefix = /reports

(before the [app:main] section), and

filter-with = proxy-prefix

cookie_path = /reports

(within the [app:main] section.)

For more recent latest Galaxy instances it's simply a case of making sure that the existing filter-with and cookie_path lines are uncommented and set to the values above.

Next it's necessary to add upstream and location sections in your nginx.conf file:

(This has many similarities to serving Galaxy from a subdirectory via nginx proxy at a subdirectory, see https://wiki.galaxyproject.org/Admin/Config/nginxProxy).

One important thing to be aware of is that the report tool doesn't include any built-in authentication, so it's recommended that you add some authentication within the web proxy. Otherwise anyone in the world could potentially access the reports for your server and see sensitive information such as user login names.

To do this with nginx, first create a htpasswd file to hold a set of user names and associated passwords, using the htpassword utility, e.g.:

htpasswd -c /etc/nginx/galaxy-reports.htpasswd admin

-c means create a new file (in this case /etc/nginx/galaxy-reports.htpasswd); admin is the username to add. The program will prompt for a password for that username, and store it in the file. You can use any username, and any filename or location (with the caveat that it must be readable by the nginx process) that you wish.

Finally to associate the password file with the reports location update the nginx config file appropriately by adding two more lines:

(I found this article very helpful here; note that it also works for https in spite of the title: "How to set up http authentication with nginx on Ubuntu 12.10" https://www.digitalocean.com/community/tutorials/how-to-set-up-http-authentication-with-nginx-on-ubuntu-12-10).

Once nginx has been restarted then anyone attempting to view the reports at http://galaxy.example.org/reports/ will be prompted to enter a username/password combination matching an entry in the htpasswd file before they are given access. Authorised users can then peruse the reports to their heart's content.

Wednesday, 22 April 2015

Using GALAXY_SLOTS with multithreaded Galaxy tools

GALAXY_SLOTS is a useful but not particularly well-publicised way of controlling the number of threads Galaxy allocates to a tool that supports multithreaded operation. It's relevant to both Galaxy admins (who need to ensure that multithreaded jobs don't try to consume more resources than they have access to) and to tool developers (who need to know how many threads are available to a tool at runtime).

Having seen various references to GALAXY_SLOTS on the developer's mailing list I'd assumed this was some esoteric feature that I would need to set up to use, but in actual fact it's almost embarrassingly simple for most cases. Essentially it can be thought of as an internal variable that's set by Galaxy when it starts a job, which indicates the number of threads that are available for that job and which can subsequently be accessed by a tool in order to make use of that number of threads.

The official documentation can be found here: https://wiki.galaxyproject.org/Admin/Config/GALAXY_SLOTS
and this covers the essential details, but the executive summary is:

Tool developers should use GALAXY_SLOTS when specifying the number of threads a tool should run with;
Galaxy admins shouldn't need to configure anything unless they're using the local runner, or (possibly) a novel cluster submission system.

And really, that's it. However the following sections give a bit more detail for those who like to have it spelled out (like me).

For tool developers

All that is required for tool developers is to specific GALAXY_SLOTS in the <command> tag in the tool XML wrapper, when setting the number of threads the tool uses.

The syntax for specifying the variable is:

\${GALAXY_SLOTS:-N}

where N is the default value to use if GALAXY_SLOTS is not set. (See the "Tool XML File syntax" documentation for the tag at https://wiki.galaxyproject.org/Admin/Tools/ToolConfigSyntax#A.3Ccommand.3E_tag_set for more details - you need to scroll down to the section on "Reserved Variables" to find it.)

For example, here's a code fragment from the XML wrapper from a tool to run the Trimmomatic program:

The number of threads defaults to 6 unless GALAXY_SLOTS is explicitly set.

(Aside: the Trimmomatic tool itself can be obtained from the toolshed at https://toolshed.g2.bx.psu.edu/view/pjbriggs/trimmomatic)

For Galaxy Admins

It turns out that generally there is nothing special to do for most cluster systems, although this is not immediately clear from the documentation: in most cases GALAXY_SLOTS is handled automagically and so doesn't require any explicit configuration.

For example for DRMAA (which is what we're using locally), we have job runners defined in our job_conf.xml file like:

In our set up, -pe smp.pe 4 above requests 4 cores for the job. When using this runner, Galaxy will automagically determine the number of cores from DRMAA (i.e. 4) and set GALAXY_SLOTS to the appropriate value - nothing more to do.

The most obvious exception is the "local" job runner, where you need to explicitly set the number of available slots using the <param id="local_slots"> tag in job_conf.xml; see https://wiki.galaxyproject.org/Admin/Config/Performance/Cluster#Local for more details.

Finally, for other job submission systems see the documentation on how to verify that the environment is being set correctly.

Wednesday, 18 March 2015

Installing ANNOVAR in Galaxy

ANNOVAR (http://www.openbioinformatics.org/annovar/) is a popular tool used to functionally annotate genetic variants detected from various genomes. The Galaxy toolshed includes a tool called table_annovar which can be used to run ANNOVAR. Installation of the tool into a local Galaxy instance is not fully automated and requires some manual steps which are sketched in the tool's README; this post expands on those basic instructions to hopefully make the process easier for others.

Note that these instructions are for the '2014-02-12' revision of the table_annovar tool (changeset 6:091154194ce8), installing into the latest_2014.08.11 version of galaxy-dist.

1. Install the table_annovar tool from the toolshed

This is the devteam owned tool on the main toolshed:

https://toolshed.g2.bx.psu.edu/view/devteam/table_annovar

and can be installed via the usual admin interface within Galaxy (see for example https://wiki.galaxyproject.org/Admin/Tools/AddToolFromToolShedTutorial).

2. Install the ANNOVAR software

In addition to the Galaxy tool you also the actual ANNOVAR software. To download a copy you first need to register at:

http://www.openbioinformatics.org/annovar/annovar_download_form.php

Once registered you should receive a link to download the latest version (e.g. annovar-2014nov12.tar.gz). Note that ANNOVAR's licensing conditions prohibit commercial use without a specific agreement, and that users are not permitted to redistribute ANNOVAR to others, including lab members.

Unpack the tar.gz file into a directory where it can be executed by your Galaxy user. For example:

# Make a location for ANNOVAR
$ mkdir -p /home/galaxy/apps/annovar/
# Move into this directory
$ cd /home/galaxy/apps/annovar/
# Unpack the ANNOVAR software
$ tar zxf /path/to/annovar-2014nov12.tar.gz
# Rename the unpacked directory to '2014nov12'
$ mv annovar 2014nov12

This puts the ANNOVAR programs into the directory /home/galaxy/apps/annovar/2014nov12. The actual location isn't so important as long as you know where it is so you can reference it in the next section.

3. Set up the Galaxy environment to make ANNOVAR available to the tool

Essentially we need to manually create the files and directories that Galaxy will use to set the environment appropriately when the ANNOVAR tool is run.

This needs to be done in the directory pointed to by the tool_dependency_dir variable in your Galaxy configuration file (either galaxy.ini or universe_wsgi.ini, depending on the age of your Galaxy distribution) - by default this is ../tool_dependencies (which is relative to your galaxy-dist directory).

Under this directory make a subdirectory for ANNOVAR, for example:

$ cd tool_dependencies/
$ mkdir -p annovar/2014nov12

In the 'annovar' directory make a symbolic link to point to this default version:

$ cd annovar/
$ ln -s 2014nov12 default

Then in the '2014nov12' dir make a file called env.sh file which looks like:

$ cd 2014nov12/
$ cat <
export PATH=/home/galaxy/apps/annovar/2014nov12:$PATH
EOF

(Substitute the directory that you unpacked the ANNOVAR software into in the previous step.) Galaxy will source this file when running the ANNOVAR tool in order to make the underlying programs available.

4. Add the 'annovar_index' data table to the master list of data tables

The ANNOVAR tool gets information about the installed databases from a file called annovar_index.loc. For the version of Galaxy that I'm using there is already a copy of this file in the galaxy-dist/tool-data directory, but the tool won't pick up any databases referenced there until we add the following to the end of the tool_data_table_conf.xml:


<table comment_char="#" name="annovar_indexes">
<columns>value, dbkey, type, path</columns>
<file path="tool-data/annovar_index.loc"> </file>
</table>

Important: this must appear before the closing </tables> tag in the file!

(Note that you will need to restart Galaxy after this step to get it to pick up the data table.)

It's possible that newer versions of Galaxy might not include annovar_index.loc, in which case you'll need to locate the copy that's supplied in the tool itself and copy that to the tool-data directory. The following Linux command (executed from galaxy-dist) should do the trick:

$ find tool-data -name "annovar_index.loc"

5. Install ANNOVAR databases and update the .loc file

At this point the tool is almost set up; it's just missing any actual databases to work with.

The list of available databases can be found here:

http://www.openbioinformatics.org/annovar/annovar_download.html

and they can be downloaded using the annotate_variation.pl script (which is part of ANNOVAR).

It is important to note that the ANNOVAR tool expects all the database files for a specific genome build to be in the same directory.

As an example: say we want to make the hg19 refGene and ensGene databases available in the Galaxy tool. In this case we first download the data:

$ cd /home/galaxy/data/annovar/
$ /home/galaxy/apps/annovar/2014nov12/annotate_variation.pl -downdb -buildver hg19 -webfrom annovar refGene hg19
$ /home/galaxy/apps/annovar/2014nov12/annotate_variation.pl -downdb -buildver hg19 -webfrom annovar ensGene hg19

This will download the files for both databases to a subdirectory called 'hg19' under /home/galaxy/data/annovar/ (you should choose or make your own location as appropriate).

Then update annovar_index.loc to point to these data. The header of the .loc file specifies the format of each line, but essentially each database should be described by a line with four tab-separated fields:

Database name: the text that appears for the database within the ANNOVAR tool)
Genome build
Database type: the ANNOVAR databases are divided into three types: "gene annotations" ('gene_ann'), "annotation regions" ('region') and "annotation databases" ('filter') - empirically, if a database download contains '.fa.' files it appears to be 'gene_ann', if it contains '.idx' files then it's 'filter'
Path to the directory holding the downloaded data files

For the refGene and ensGene example above, for both the genome build is 'hg19', the data are of type 'gene_ann', and the directory holding the files is /home/galaxy/data/annovar/hg19/. So the .loc file entry will look like:

refGene hg19 gene_ann /home/galaxy/data/annovar/hg19

Finally, you will need to restart Galaxy to refresh the available databases for the ANNOVAR tool (or if you only have a single Galaxy server running then you can use the option under "Manage local data (beta)" in the "Admin" interface to reload the data).

6. Troubleshooting

It's recommended to run a few example ANNOVAR jobs to check that everything is set up correctly. Some problems that I've encountered in the past include:

#1 The expected databases don't appear as options in the tool

Only databases which match the genome build assigned to the input dataset will be presented as options. Check that the input dataset has been assigned to the correct genome build.

#2 The job produces an empty output file when annotating against a single database

Check the log for the a line like:

convert2annovar.pl: command not found

which suggests that the env.sh file created in step #3 above is not correct.

#3 The job produces an empty output file when annotating against multiple databases

First check for the previous error; if this isn't the case then check the stderr output for a message like:

Error: the required database file ... does not exist.

which suggests a problem with your annovar_index.loc file. Check that the database file does indeed exist, and that all the data files for the genome build are in the same directory (see step #5 above).

Tuesday, 17 March 2015

Custom Google web searches for Galaxy help

Google can be a great tool when searching for help with deploying or developing software. However in specific case of Galaxy, searching the whole of the web can also throw up a lot of unrelated hits (such as astrophysics, or tablet products, to take two examples). I learned recently though that there are now a few custom Google searches available which can help narrow the results.

The full set of searches can be accessed via

http://galaxyproject.org/search/

with the most useful for me being the "Galaxy Admin & Development" subsearch at http://galaxyproject.org/search/getgalaxy/

I haven't used them extensively yet but the couple of searches I've tried have been encouragingly relevant. So hopefully these will help me avoid stellar aggregates and mobile phones in future.

Thursday, 12 February 2015

FTP upload to Galaxy using ProFTPd and PBKDF2

Recently I've been looking at enabling FTP upload for a local instance of Galaxy (based on the latest_2014.08.11 version of galaxy-dist.)

The Galaxy documentation for integrating an FTP upload server can be found at https://wiki.galaxyproject.org/Admin/Config/UploadviaFTP, and I've been working with their recommended choice of ProFTPd. Overall the instructions are pretty straightforward, but I've encountered a few issues, mostly to do with Galaxy's change from SHA1 to PBKDF2 as its default choice of password authentication. This post details how I handled these to get the upload working.

Note that I'm assuming that Galaxy is using Postgres as its database engine.

1. Get ProFTPd

ProFTP is simple to install on Scientific Linux 6 via yum:

yum install proftpd proftpd-postgresql

with proftpd-postgresql providing the extensions required by ProFTPd to access PostgreSQL databases.

If you need to build ProFTPd manually from source (for example because the default version doesn't have features that you need such as handling PBKDF2 password encryption - see below) then download the code from the ProFTP website and do e.g.:

# yum install postgresql-devel openssl-devel
# tar zvxf proftpd-1.3.5.tar.gz
# cd proftpd-1.3.5
# ./configure --prefix=/opt/apps/proftpd/1.3.5 --disable-auth-file --disable-ncurses --disable-ident --disable-shadow --enable-openssl --with-modules=mod_sql:mod_sql_postgres:mod_sql_passwd
# make ; make install

Note that the final step must be performed with superuser privileges.

2. Check how your Galaxy installation handles password encryption

Galaxy appears to support two types of password encryption: older versions of Galaxy use SHA1 to encrypt its passwords, whereas newer versions use a more sophisticated protocol called PBKDF2.

If you're using SHA1 then configuring ProFTPd is pretty straightforward, and the instructions on the Galaxy wiki should work out of the box. If you're using PBKDF2 then the configuration is a little more involved.

You can configure Galaxy to explicitly revert to SHA1 by setting use_pbkdf2 = False in the configuration files, in the [app:main] section; however by default PBKDF2 is used, and this blog post assumes that this is the case.

3. (Optionally) Set up a database user specifically for FTP authentication

This is not critical but is recommended. When users try to upload files to the FTP server they will log in using their Galaxy username and password. In order to enable this ProFTPd needs to be able to query Galaxy's database to check these credentials, and doing it via a database user with limited privileges (essentially only SELECT on the galaxy_user table) is more secure than via the one that Galaxy itself uses.

For Postgresql the instructions given on the Galaxy wiki are fine.

4. Create an area where ProFTPd will put uploaded files (and point Galaxy to it)

This should be a directory on the system which is readable by the Galaxy user. The ftp_upload_dir parameter in the Galaxy config file should be set to point to this location.

(It appears that you also need to set a value for ftp_upload_site in order for the uploaded files to be presented to the user when they got to "Upload Files".)

5. Configure ProFTPd

ProFTPd's default configuration file is located in /etc/proftpd.conf (if using the default system installation), or otherwise in the etc subdirectory where you installed ProFTPd if you built your own.

5.1 Configuring ProFTPd to use SHA1 password authentication

The Galaxy documentation gives an example ProFTPd config file that should work for the old SHA1 password encryption. I don't cover using SHA1 any further in this post.

5.2 Configuring ProFTPd to use PBKDF2 password authentication

As this is not documented on the Galaxy wiki, I used a sample ProFTPd configuration posted by Ricardo Perez in this thread from the Galaxy Developers mailing list as a starting point: http://dev.list.galaxyproject.org/ProFTPD-integration-with-Galaxy-td4660295.html - his example was invaluable to me for getting this working.

Here's a version of the ProFTPd conf file that I created to enable PBKDF2 authentication:

Note that the SQLPasswordPBKDF2 directive is not available in ProFTPd before version 1.3.5rc3, so check which version you're using.

(It should also be possible to configure ProFTPd to use both SHA1 and PBKDF2 authentication, and there are hints on how to do this in Ricardo's message linked above. However I haven't tried implementing it yet.)

6. Test your ProFTPd settings

ProFTPd can be run as a system service but during initial setup and debugging I found it useful to run directly from a console. In particular:

proftpd --config /path/to/conf_file -t performs basic checks on the conf file and warns if there are any syntax errors or other problems
proftpd --config /path/to/conf_file -n starts the server in "no daemon" mode
proftpd --config /path/to/conf_file -n -d 10 runs in debugging mode with maximal output, which is useful for diagnosing problems with the authentication.

On our system I also needed to update the Shorewall firewall settings to allow external access to port 21 (the default port used by FTP services), by editing the /etc/shorewall/rules file.

You can then test by ftp'ing to the server and checking that you can log in using your Galaxy credentials, upload a file, see that it appears in the correct place on the file with the correct file ownership and permissions (it should be read/writeable by the user running the Galaxy process), and check that Galaxy's upload tool presents it as an option.

If any of these steps fail then running ProFTPd with the debugging option can be really helpful in understanding what's happening behind the scenes.

One other gotcha is that if the Galaxy user UID or GID is less than 999, then you will need to set SQLMinID (or similar) in the ProFTPd conf file to a suitable value, otherwise the uploaded files will not be assigned to the correct user (you can get the UID/GID using the "id" command).

7. Make ProFTPd run as a service

If everything appears to be working then you can set up ProFTP to run as a system service - if you're using the system installed version then there should already be be an /etc/init.d/proftpd file to allow you to do

service proftpd start

Otherwise you will need to make your own init.d script for ProFTPd - I used the one in the documentation at http://www.proftpd.org/docs/howto/Stopping.html as a starting point, put it into /etc/init.d/ and edited the FTPD_BIN and FTPD_CONF variables to point to the appropriate files for my installation.

Once this is done you should have FTP uploads working with Galaxy using PBKDF2 password authentication.

Updates: fixed typos in name of "PBKDF2" and clarify that SHA1 is not used (27/02/2015).

Welcome

I'm starting this blog principally as an on-going way to document what I've learned about working with Penn State's Galaxy platform for data intensive biomedical research. Part of my current role involves supporting a number of Galaxy instances at the University of Manchester, which includes the set up, maintenance and administration of Galaxy for teaching and research. I also do some development of new tools within the Galaxy framework, in collaboration with local bioinformaticians and researchers. For want of a better term I am a "galactic engineer"; and given this focus I expect that my posts will be more technical than scientific.

I love Galaxy: it's a great platform supported by talented developers and a friendly and active community, and we have benefited from deploying it locally both for research and training. However I find it can sometimes be tricky to keep up with "best practice" as Galaxy continues to evolve- especially given the limited time I have to spend on it alongside other things. So while this blog is primarily for my own benefit (and the information is likely to go out of date over time) I hope that it might still be of use to others out there like me in their own efforts to set up and use Galaxy.

Finally, a disclaimer: all opinions, mistakes and misapprehensions are my own!