Thursday, 30 August 2018

Mercurial-based tool installation issues in Galaxy 18.05

Recently I've encountered a subtle problem with tool installation after upgrading our local production instances to Galaxy release 18.05, which I'd like to document here in case it comes up again in future.

The problem manifests itself when attempting to install a tool from the main toolshed via the admin interface: after clicking Install, the tool installation status goes almost immediately to Error. Further inspection reveals that the tool repository hasn't been cloned to the local filesystem, and no dependencies are installed.

Frustratingly this fails to leave any error messages in the logs which might help to diagnose the cause. However attempting the install via the Galaxy API (using nebulizer) did return an error message:

Error cloning repository: [Errno 2] No such file or directory

I was able to track this down to the clone_repository function in  lib/tool_shed/util/hg_util.py, where it is issued when something goes wrong with the hg clone ... command used in the tool installation process. hg is the name of the Mercurial version control command, and essentially the problem was that this command couldn't be found by Galaxy.

Our local Galaxy installations are configured to use supervisor with uWSGI, with the Galaxy dependencies installed into a Python virtualenv. Since this virtualenv included Mercurial, I wondered why hg wasn't being picked up from there for the tool installation process.

Marius van den Beek offered some helpful insights via the galaxy-dev mailing list which clarified the situation:
Recent galaxy releases are using the `hg` command that should be automatically installed along with other galaxy dependencies.
If you're running galaxy in a virtualenv then that virtualenv should have the `hg` script in the bin folder.
Depending on how you start galaxy you may need to add the virtualenv's `bin` folder to the `PATH`.
Based on this it turned out that I needed to add an 'environment' parameter to the supervisor.ini for Galaxy file, which to specify the virtualenv to use and add its bin directory to the PATH - something like:

environment = VIRTUAL_ENV="/srv/galaxy/venv",PATH="/srv/galaxy/venv/bin:%(ENV_PATH)s"

(This parameter is mentioned in the installation documentation, in the Scaling and Load Balancing section, but only for configuring handler processes. However since our instances are using the uwsgi + mules strategy, it didn't occur to me that it would still be needed.)

Restarting Galaxy with the updated supervisor.ini file enabled tool installation to work without problems again.

Some closing asides:

  • The problem can be masked if Mercurial is installed elsewhere on the system and is on the Galaxy user's PATH (for example /usr/bin/hg)
  • If there is a system version but it is very old (for example Scientific Linux 7 has Mercurial 1.7) then it can cause a slightly different error in clone_repository, but the outcome and fix should be the same as above
  • Since first encountering this issue I've come across a strange variant, whereby Mercurial is installed in the Galaxy virtualenv and supervisor is correctly configured but the tool installations still fail immediately. In this case for some unknown reason it turned out that the hg script in the virtualenv wasn't executable - adding 'execute' permission fixed this one.

Tuesday, 25 April 2017

Securing Galaxy with HTTPS running with Nginx using Let’s Encrypt

Background

To secure communication between a Galaxy instance and its users it is best to enable HTTPS on the Galaxy web server, to ensure that all data transmissions between Galaxy and the end user (including sensitive information such as usernames and passwords) are encrypted. This can done by obtaining and installing SSL/TLS certificates on the server.

The simplest approach in the past was to use self-signed certificates as a way to enable HTTPS while avoiding the cost of purchasing certificates from a commercial Certificate Authority (CA) (for example by using the make-dummy-certs utility found in e.g. /usr/ssl/certs). The downside of this approach is that when a user first tries to access the server their web browser will complain that the certificates are not trusted, and they would typically have to create a one-off security exception before they can access the Galaxy service.

More recently however, a free Certificate Authority called Let’s Encrypt (https://letsencrypt.org/has been set up which issues free certificates as part of its stated mission to “secure the web”. This blog post gives an overview of how we obtained and installed certificates from Let's Encrypt to enable HTTPS for our production Galaxy instances, using their automated cert-bot client utility.

Before beginning

The procedure described below uses the 'webroot' plugin of cert-bot (see https://certbot.eff.org/docs/using.html#webroot), which is a general method recommended for obtaining certificates web servers running nginx. cert-bot also has a plugin for nginx but at the time of writing this is still at alpha-release stage so I didn't use it for our Galaxy servers (see https://certbot.eff.org/docs/using.html#nginx for more details).

For Apache-based servers you can use a dedicated plugin described at https://certbot.eff.org/docs/using.html#apache, which offers a more automated procedure than the one described here.

Also, although it targets a different operating system to ours and while many of the details are now out-of-date, DigitalOcean's how-to guide at https://www.digitalocean.com/community/tutorials/how-to-secure-nginx-with-let-s-encrypt-on-ubuntu-14-04 is still a useful resource and was immensely helpful to me for understanding the overall process.

Finally, please note that the procedure and its details are likely to change over time. Make sure you check the documentation before carrying out any of these operations on your own infrastructure!

Step 1: Install cert-bot (Let’s Encrypt client) on the server

To begin you need to ensure the Let's Encrypt cert-bot utility (https://certbot.eff.org/) is available on the server, to perform the job of obtaining and installing the certificates.

The documentation recommends that if possible you should use the cert-bot package provided by the package manager for your system (e.g. yum, apt etc). However if one isn't available (or is unsuitable e.g. because it's out-of-date) then you can install the client using the certbot-auto wrapper script instead (see https://certbot.eff.org/docs/install.html#certbot-auto). This is the approach I used, putting certbot-auto into /usr/local/bin on the server running Galaxy and nginx.

(Note that certbot-auto takes the same arguments as the cert-bot utility, the only difference is that if necessary it will download and update itself first each time it's run.)

  • Aside: there is also a cert-bot package available via the Python Package Index (PyPI). When I first performed this procedure I noted that the documentation emphasised that cert-bot should not be installed 'pip install', but now I can't find any reference to this. However I would still avoid installing from PyPI for the time being.

Step2: Get certificates using the 'webroot' method

cert-bot provides a number of different ways to obtain certificates depending on the webserver software being used. The 'webroot' protocol used here is less automated than some of the other procedures but is still quite straightforward, and works by placing a special file on your webserver which Let's Encrypt can attempt to fetch in order to verify the server name and details that are supplied when the cert-bot client is run.

First we need to set up a special directory called .well-known, where Let's Encrypt will place its file:
  • Create a directory called .well-known in the document root of the server (the default for nginx is /usr/share/nginx/hmtl but the actual path can be found by looking up the value of webroot-path in the server configuration), e.g.:

    mkdir /usr/share/nginx/hmtl/.well-known

    Optionally also add a dummy index file to help check that the directory is visible via web browser later, e.g.:

    cat >/usr/share/nginx/hmtl/.well-known/index.html <<EOF
    Hello world!
    EOF
  • Add a new location block inside the server block in the nginx configuration file, to allow access to the .well-known directory:

    location ~ /.well-known {
        allow all;
    }
  • Restart nginx and check that the .well-known directory is visible (e.g. by pointing a web browser at it)
Then we need to run certbot-auto (or cert-bot) interactively to generate and install the certificates:
  • sudo certbot-auto certonly --webroot -w /usr/share/nginx/html -d MYDOMAIN
where MYDOMAIN is the domain name of your Galaxy server (e.g. "palfinder.ls.manchester.ac.uk").

  • Aside: note that this bootstraps certbot, including checking for the system packages that it requires; you'll be prompted to install any that it thinks are missing via the system package manager e.g. yum.

certbot will then prompt you to agree to Let's Encrypt's terms and conditions and ask you to provide an email address which will be used for notices and for lost key recovery.

If all goes well then this should produce a set of certificate files under /etc/letsencrypt/archive (with links to these from /etc/letsencrypt/live/):
  • cert.perm (your domain's certificate)
  • chain.pem (the Let's Encrypt chain certificate)
  • fullchain.pem (cert.pem and chain.pem combined)
  • privkey.pem (your certificate's private key)
  • IMPORTANT: you should ensure that the certificate files are backed up to a secure and safe location!
Step 3: configure TLS/SSL on nginx using the certificates

To enable HTTPS we need to configure nginx to listen on port 443 with SSL enabled, and to use the certificates from Let's Encrypt. This is done by adding the following to the server block in the nginx configuration file, for example:

server {
    listen 443;
    ssl on;
    ssl_certificate /etc/letsencrypt/live/MYDOMAIN/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/MYDOMAIN/privkey.pem;
}

(Again, your actual domain name should be substituted for MYDOMAIN above.)

It's also a good idea to block or redirect HTTP traffic, so that users don't accidentally send data via an insecure connection - for example to redirect to :

server {
    listen 80 default;
    server_name MYDOMAIN;
    rewrite ^ https://$server_name$request_uri? permanent;
}

Once nginx is restarted you can check using your browser that HTTPS is working for your Galaxy instance; you can also use the Qualys SSL Labs website to check your server configuration:

https://www.ssllabs.com/ssltest/index.html

(NB this is useful for flagging up other issues which you might wish to address!)

Step 4: set up automated certificate renewal

Finally: since all certificates issued by Let's Encrypt expire after 90 days, they recommend that they should be renewed at least once every 3 months.

It's straightforward to automate this process by setting up a cron job on the server to run cert-bot or certbot-auto's 'renew' command (which will renew any previously-obtained certificates that are due to expire in less than 30 days) and then restart nginx (so that any renewed certificates will be loaded).

For example I have the following commands in the root crontab on our server:

# Check SSL certificate renewal from Let's Encrypt

30 2 * * 1 /usr/local/bin/certbot-auto renew >> /var/log/le-renew
# Restart nginx after SSL certificate renewal
35 2 * * 1 service nginx restart >> /var/log/le-renew

See the documentation at https://certbot.eff.org/docs/using.html#renewing-certificates for more information on certificate renewal.

Update 22nd October 2018: the original crontab lines above didn't work for me - the certificate renewals would fail and have to be performed manually, resulting in downtime for the period when nginx would no longer have valid SSL certificates.

Since then I've replaced the original crontab lines with the following single line:

30 2 * * 1 /usr/local/bin/certbot-auto renew --deploy-hook "/sbin/service nginx reload" >> /var/log/le-renew


which uses certbot-auto's --deploy-hook option to reload the nginx configuration on successful certificate renewal via the service command. Note that the full path to service is required as cron jobs have a minimal PATH which doesn't seem to include /sbin.

Tuesday, 2 June 2015

Exposing Galaxy reports via nginx in a production instance

Galaxy includes a report tool that is separate from the main process but which gives lots of potentially useful information about the usage of a Galaxy instance, for example the numbers of jobs that have been run each month, how much disk space each user is currently consuming and so on.

However there doesn't appear to be much documentation about the report tool on the official Galaxy wiki: the most I could find was a rather sparse page at https://wiki.galaxyproject.org/Admin/UsageReports, which gives a very bare bones overview, and doesn't include any information on how it might be exposed in a secure manner in a production environment. Therefore in this post I outline how I've done this for our local Galaxy set up, which uses nginx; however I imagine it could be adapted to work with Apache instead.

1. Set up up the report tool to run on localhost

The report tool takes its configuration settings from a file called reports_wsgi.ini, which is located in the config subdirectory of the Galaxy distribution.

Configuring the reports for your local setup is a case of:
  • Making a copy of reports_wsgi.ini.sample called reports_wsgi.ini
  • Editing the database_connection and file_path (if not the default) parameters to match those in your galaxy.ini (or universe_wsgi.ini) file
  • Optionally, editing the port parameter (by default the tool uses port 9001)
  • You should also set the 'salt' parameter session_secret if you intend to expose the reports via the web proxy (see below)
Then you can start the report server using

sh run_reports.sh

and view the reports by pointing a web browser running on the same server to http://127.0.0.1:9001.

If you'd like the report tool to persist between sessions then use

sh run_reports.sh --daemon

to run it as a background process. As with Galaxy itself, use --stop-daemon to halt the background process. (The log file is written to reports_webapp.log if you need to try and debug a problem.)

2. Expose the report tool via nginx

If you're running a production Galaxy and want to be able access the reports from a browser running on a different machine to your Galaxy server then you can could consider using SSH tunnelling, which essentially forwards a port on your local machine to one on the server i.e. port 9001 where the report tool is serving from (see "SSH Tunneling Made Easy" at http://www.revsys.com/writings/quicktips/ssh-tunnel.html for more details of how to do this).

Alternatively if you are using a web proxy (as is standard for a production setup) then you could try serving the reports also via the proxy (in this case nginx). In this example I assume that if Galaxy is being served from e.g. http://galaxy.example.org/ then the reports will be viewed via http://galaxy.example.org/reports/.

First, make the appropriate edits to reports_wsgi.ini: if you have an older Galaxy instance then you'll need to add some sections to the file, specifically:

[filter:proxy-prefix]
use = egg:PasteDeploy#prefix
prefix = /reports

(before the [app:main] section), and

filter-with = proxy-prefix
cookie_path = /reports

(within the [app:main] section.)

For more recent latest Galaxy instances it's simply a case of making sure that the existing filter-with and cookie_path lines are uncommented and set to the values above.

Next it's necessary to add upstream and location sections in your nginx.conf file:

(This has many similarities to serving Galaxy from a subdirectory via nginx proxy at a subdirectory, see https://wiki.galaxyproject.org/Admin/Config/nginxProxy).

One important thing to be aware of is that the report tool doesn't include any built-in authentication, so it's recommended that you add some authentication within the web proxy. Otherwise anyone in the world could potentially access the reports for your server and see sensitive information such as user login names.

To do this with nginx, first create a htpasswd file to hold a set of user names and associated passwords, using the htpassword utility, e.g.:

htpasswd -c /etc/nginx/galaxy-reports.htpasswd admin

-c means create a new file (in this case /etc/nginx/galaxy-reports.htpasswd); admin is the username to add. The program will prompt for a password for that username, and store it in the file. You can use any username, and any filename or location (with the caveat that it must be readable by the nginx process) that you wish.

Finally to associate the password file with the reports location update the nginx config file appropriately by adding two more lines:

(I found this article very helpful here; note that it also works for https in spite of the title: "How to set up http authentication with nginx on Ubuntu 12.10" https://www.digitalocean.com/community/tutorials/how-to-set-up-http-authentication-with-nginx-on-ubuntu-12-10).

Once nginx has been restarted then anyone attempting to view the reports at http://galaxy.example.org/reports/ will be prompted to enter a username/password combination matching an entry in the htpasswd file before they are given access. Authorised users can then peruse the reports to their heart's content.

Wednesday, 22 April 2015

Using GALAXY_SLOTS with multithreaded Galaxy tools

GALAXY_SLOTS is a useful but not particularly well-publicised way of controlling the number of threads Galaxy allocates to a tool that supports multithreaded operation. It's relevant to both Galaxy admins (who need to ensure that multithreaded jobs don't try to consume more resources than they have access to) and to tool developers (who need to know how many threads are available to a tool at runtime).

Having seen various references to GALAXY_SLOTS on the developer's mailing list I'd assumed this was some esoteric feature that I would need to set up to use, but in actual fact it's almost embarrassingly simple for most cases. Essentially it can be thought of as an internal variable that's set by Galaxy when it starts a job, which indicates the number of threads that are available for that job and which can subsequently be accessed by a tool in order to make use of that number of threads.

The official documentation can be found here: https://wiki.galaxyproject.org/Admin/Config/GALAXY_SLOTS
and this covers the essential details, but the executive summary is:
  • Tool developers should use GALAXY_SLOTS when specifying the number of threads a tool should run with;
  • Galaxy admins shouldn't need to configure anything unless they're using the local runner, or (possibly) a novel cluster submission system.
And really, that's it. However the following sections give a bit more detail for those who like to have it spelled out (like me).

For tool developers

All that is required for tool developers is to specific GALAXY_SLOTS in the <command> tag in the tool XML wrapper, when setting the number of threads the tool uses.

The syntax for specifying the variable is:

\${GALAXY_SLOTS:-N}

where N is the default value to use if GALAXY_SLOTS is not set. (See the "Tool XML File syntax" documentation for the tag at https://wiki.galaxyproject.org/Admin/Tools/ToolConfigSyntax#A.3Ccommand.3E_tag_set for more details - you need to scroll down to the section on "Reserved Variables" to find it.)

For example, here's a code fragment from the XML wrapper from a tool to run the Trimmomatic program:


The number of threads defaults to 6 unless GALAXY_SLOTS is explicitly set.

(Aside: the Trimmomatic tool itself can be obtained from the toolshed at https://toolshed.g2.bx.psu.edu/view/pjbriggs/trimmomatic)

For Galaxy Admins

It turns out that generally there is nothing special to do for most cluster systems, although this is not immediately clear from the documentation: in most cases GALAXY_SLOTS is handled automagically and so doesn't require any explicit configuration.

For example for DRMAA (which is what we're using locally), we have job runners defined in our job_conf.xml file like:


In our set up, -pe smp.pe 4 above requests 4 cores for the job. When using this runner, Galaxy will automagically determine the number of cores from DRMAA (i.e. 4) and set GALAXY_SLOTS to the appropriate value - nothing more to do.

The most obvious exception is the "local" job runner, where you need to explicitly set the number of available slots using the <param id="local_slots"> tag in job_conf.xml; see https://wiki.galaxyproject.org/Admin/Config/Performance/Cluster#Local for more details.

Finally, for other job submission systems see the documentation on how to verify that the environment is being set correctly.

Wednesday, 18 March 2015

Installing ANNOVAR in Galaxy

ANNOVAR (http://www.openbioinformatics.org/annovar/) is a popular tool used to functionally annotate genetic variants detected from various genomes. The Galaxy toolshed includes a tool called table_annovar which can be used to run ANNOVAR. Installation of the tool into a local Galaxy instance is not fully automated and requires some manual steps which are sketched in the tool's README; this post expands on those basic instructions to hopefully make the process easier for others.

Note that these instructions are for the '2014-02-12' revision of the table_annovar tool (changeset 6:091154194ce8), installing into the latest_2014.08.11 version of galaxy-dist.

1. Install the table_annovar tool from the toolshed

This is the devteam owned tool on the main toolshed:

https://toolshed.g2.bx.psu.edu/view/devteam/table_annovar

and can be installed via the usual admin interface within Galaxy (see for example https://wiki.galaxyproject.org/Admin/Tools/AddToolFromToolShedTutorial).

2. Install the ANNOVAR software

In addition to the Galaxy tool you also the actual ANNOVAR software. To download a copy you first need to register at:

http://www.openbioinformatics.org/annovar/annovar_download_form.php

Once registered you should receive a link to download the latest version (e.g. annovar-2014nov12.tar.gz). Note that ANNOVAR's licensing conditions prohibit commercial use without a specific agreement, and that users are not permitted to redistribute ANNOVAR to others, including lab members.

Unpack the tar.gz file into a directory where it can be executed by your Galaxy user. For example:

# Make a location for ANNOVAR
$ mkdir -p /home/galaxy/apps/annovar/
# Move into this directory
$ cd /home/galaxy/apps/annovar/
# Unpack the ANNOVAR software
$ tar zxf /path/to/annovar-2014nov12.tar.gz
# Rename the unpacked directory to '2014nov12'
$ mv annovar 2014nov12

This puts the ANNOVAR programs into the directory /home/galaxy/apps/annovar/2014nov12. The actual location isn't so important as long as you know where it is so you can reference it in the next section.

3. Set up the Galaxy environment to make ANNOVAR available to the tool

Essentially we need to manually create the files and directories that Galaxy will use to set the environment appropriately when the ANNOVAR tool is run.

This needs to be done in the directory pointed to by the tool_dependency_dir variable in your Galaxy configuration file (either galaxy.ini or universe_wsgi.ini, depending on the age of your Galaxy distribution) - by default this is ../tool_dependencies (which is relative to your galaxy-dist directory).

Under this directory make a subdirectory for ANNOVAR, for example:

$ cd tool_dependencies/
$ mkdir -p annovar/2014nov12

In the 'annovar' directory make a symbolic link to point to this default version:

$ cd annovar/
$ ln -s 2014nov12 default

Then in the '2014nov12' dir make a file called env.sh file which looks like:

$ cd 2014nov12/
$ cat <
export PATH=/home/galaxy/apps/annovar/2014nov12:$PATH
EOF

(Substitute the directory that you unpacked the ANNOVAR software into in the previous step.) Galaxy will source this file when running the ANNOVAR tool in order to make the underlying programs available.

4. Add the 'annovar_index' data table to the master list of data tables

The ANNOVAR tool gets information about the installed databases from a file called annovar_index.loc. For the version of Galaxy that I'm using there is already a copy of this file in the galaxy-dist/tool-data directory, but the tool won't pick up any databases referenced there until we add the following to the end of the tool_data_table_conf.xml:

    <!-- Location of ANNOVAR databases -->
    <table comment_char="#" name="annovar_indexes">
        <columns>value, dbkey, type, path</columns>
        <file path="tool-data/annovar_index.loc"> </file>
    </table>

Important: this must appear before the closing </tables> tag in the file!

(Note that you will need to restart Galaxy after this step to get it to pick up the data table.)

It's possible that newer versions of Galaxy might not include annovar_index.loc, in which case you'll need to locate the copy that's supplied in the tool itself and copy that to the tool-data directory. The following Linux command (executed from galaxy-dist) should do the trick:

$ find tool-data -name "annovar_index.loc"

5. Install ANNOVAR databases and update the .loc file

At this point the tool is almost set up; it's just missing any actual databases to work with.

The list of available databases can be found here:

http://www.openbioinformatics.org/annovar/annovar_download.html

and they can be downloaded using the annotate_variation.pl script (which is part of ANNOVAR).

It is important to note that the ANNOVAR tool expects all the database files for a specific genome build to be in the same directory.

As an example: say we want to make the hg19 refGene and ensGene databases available in the Galaxy tool. In this case we first download the data:

$ cd /home/galaxy/data/annovar/
$ /home/galaxy/apps/annovar/2014nov12/annotate_variation.pl -downdb -buildver hg19 -webfrom annovar refGene hg19
$ /home/galaxy/apps/annovar/2014nov12/annotate_variation.pl -downdb -buildver hg19 -webfrom annovar ensGene hg19

This will download the files for both databases to a subdirectory called 'hg19' under /home/galaxy/data/annovar/ (you should choose or make your own location as appropriate).

Then update annovar_index.loc to point to these data. The header of the .loc file specifies the format of each line, but essentially each database should be described by a line with four tab-separated fields:

  • Database name: the text that appears for the database within the ANNOVAR tool)
  • Genome build
  • Database type: the ANNOVAR databases are divided into three types: "gene annotations" ('gene_ann'), "annotation regions" ('region') and "annotation databases" ('filter') -  empirically, if a database download contains '.fa.' files it appears to be 'gene_ann', if it contains '.idx' files then it's 'filter'
  • Path to the directory holding the downloaded data files

For the refGene and ensGene example above, for both the genome build is 'hg19', the data are of type 'gene_ann', and the directory holding the files is /home/galaxy/data/annovar/hg19/. So the .loc file entry will look like:

refGene   hg19 gene_ann /home/galaxy/data/annovar/hg19

Finally, you will need to restart Galaxy to refresh the available databases for the ANNOVAR  tool (or if you only have a single Galaxy server running then you can use the option under "Manage local data (beta)" in the "Admin" interface to reload the data).

6. Troubleshooting

It's recommended to run a few example ANNOVAR jobs to check that everything is set up correctly. Some problems that I've encountered in the past include:

#1 The expected databases don't appear as options in the tool

Only databases which match the genome build assigned to the input dataset will be presented as options. Check that the input dataset has been assigned to the correct genome build.

#2 The job produces an empty output file when annotating against a single database

Check the log for the a line like:

convert2annovar.pl: command not found

which suggests that the env.sh file created in step #3 above is not correct.

#3 The job produces an empty output file when annotating against multiple databases

First check for the previous error; if this isn't the case then check the stderr output for a message like:

Error: the required database file ... does not exist.

which suggests a problem with your annovar_index.loc file. Check that the database file does indeed exist, and that all the data files for the genome build are in the same directory (see step #5 above).

Tuesday, 17 March 2015

Custom Google web searches for Galaxy help

Google can be a great tool when searching for help with deploying or developing software. However in specific case of Galaxy, searching  the whole of the web can also throw up a lot of unrelated hits (such as astrophysics, or tablet products, to take two examples). I learned recently though that there are now a few custom Google searches available which can help narrow the results.

The full set of searches can be accessed via


with the most useful for me being the "Galaxy Admin & Development" subsearch at http://galaxyproject.org/search/getgalaxy/

I haven't used them extensively yet but the couple of searches I've tried have been encouragingly relevant. So hopefully these will help me avoid stellar aggregates and mobile phones in future.

Thursday, 12 February 2015

FTP upload to Galaxy using ProFTPd and PBKDF2

Recently I've been looking at enabling FTP upload for a local instance of Galaxy (based on the latest_2014.08.11 version of galaxy-dist.)

The Galaxy documentation for integrating an FTP upload server can be found at https://wiki.galaxyproject.org/Admin/Config/UploadviaFTP, and I've been working with their recommended choice of ProFTPd. Overall the instructions are pretty straightforward, but I've encountered a few issues, mostly to do with Galaxy's change from SHA1 to PBKDF2 as its default choice of password authentication. This post details how I handled these to get the upload working.

Note that I'm assuming that Galaxy is using Postgres as its database engine.

1. Get ProFTPd

ProFTP is simple to install on Scientific Linux 6 via yum:
  • yum install proftpd proftpd-postgresql
with proftpd-postgresql providing the extensions required by ProFTPd to access PostgreSQL databases.

If you need to build ProFTPd manually from source (for example because the default version doesn't have features that you need such as handling PBKDF2 password encryption - see below) then download the code from the ProFTP website and do e.g.:

# yum install postgresql-devel openssl-devel
# tar zvxf proftpd-1.3.5.tar.gz
# cd proftpd-1.3.5
# ./configure --prefix=/opt/apps/proftpd/1.3.5 --disable-auth-file --disable-ncurses --disable-ident --disable-shadow --enable-openssl --with-modules=mod_sql:mod_sql_postgres:mod_sql_passwd
# make ; make install

Note that the final step must be performed with superuser privileges.

2. Check how your Galaxy installation handles password encryption

Galaxy appears to support two types of password encryption: older versions of Galaxy use SHA1 to encrypt its passwords, whereas newer versions use a more sophisticated protocol called PBKDF2.

If you're using SHA1 then configuring ProFTPd is pretty straightforward, and the instructions on the Galaxy wiki should work out of the box. If you're using PBKDF2 then the configuration is a little more involved.

You can configure Galaxy to explicitly revert to SHA1 by setting use_pbkdf2 = False in the configuration files, in the [app:main] section; however by default PBKDF2 is used, and this blog post assumes that this is the case.

3. (Optionally) Set up a database user specifically for FTP authentication

This is not critical but is recommended. When users try to upload files to the FTP server they will log in using their Galaxy username and password. In order to enable this ProFTPd needs to be able to query Galaxy's database to check these credentials, and doing it via a database user with limited privileges (essentially only SELECT on the galaxy_user table) is more secure than via the one that Galaxy itself uses.

For Postgresql the instructions given on the Galaxy wiki are fine.

4. Create an area where ProFTPd will put uploaded files (and point Galaxy to it)

This should be a directory on the system which is readable by the Galaxy user. The ftp_upload_dir parameter in the Galaxy config file should be set to point to this location.

(It appears that you also need to set a value for ftp_upload_site in order for the uploaded files to be presented to the user when they got to "Upload Files".)

5. Configure ProFTPd

ProFTPd's default configuration file is located in /etc/proftpd.conf (if using the default system installation), or otherwise in the etc subdirectory where you installed ProFTPd if you built your own.

5.1 Configuring ProFTPd to use SHA1 password authentication

The Galaxy documentation gives an example ProFTPd config file that should work for the old SHA1 password encryption. I don't cover using SHA1 any further in this post.

5.2 Configuring ProFTPd to use PBKDF2 password authentication

As this is not documented on the Galaxy wiki, I used a sample ProFTPd configuration posted by Ricardo Perez in this thread from the Galaxy Developers mailing list as a starting point: http://dev.list.galaxyproject.org/ProFTPD-integration-with-Galaxy-td4660295.html - his example was invaluable to me for getting this working.

Here's a version of the ProFTPd conf file that I created to enable PBKDF2 authentication:


Note that the SQLPasswordPBKDF2 directive is not available in ProFTPd before version 1.3.5rc3, so check which version you're using.

(It should also be possible to configure ProFTPd to use both SHA1 and PBKDF2 authentication, and there are hints on how to do this in Ricardo's message linked above. However I haven't tried implementing it yet.)

6. Test your ProFTPd settings

ProFTPd can be run as a system service but during initial setup and debugging I found it useful to run directly from a console. In particular:
  • proftpd --config /path/to/conf_file -t performs basic checks on the conf file and warns if there are any syntax errors or other problems
  • proftpd --config /path/to/conf_file -n starts the server in "no daemon" mode
  • proftpd --config /path/to/conf_file -n -d 10 runs in debugging mode with maximal output, which is useful for diagnosing problems with the authentication.
On our system I also needed to update the Shorewall firewall settings to allow external access to port 21 (the default port used by FTP services), by editing the /etc/shorewall/rules file.

You can then test by ftp'ing to the server and checking that you can log in using your Galaxy credentials, upload a file, see that it appears in the correct place on the file with the correct file ownership and permissions (it should be read/writeable by the user running the Galaxy process), and check that Galaxy's upload tool presents it as an option.

If any of these steps fail then running ProFTPd with the debugging option can be really helpful in understanding what's happening behind the scenes.

One other gotcha is that if the Galaxy user UID or GID is less than 999, then you will need to set SQLMinID (or similar) in the ProFTPd conf file to a suitable value, otherwise the uploaded files will not be assigned to the correct user (you can get the UID/GID using the "id" command).

7. Make ProFTPd run as a service

If everything appears to be working then you can set up ProFTP to run as a system service - if you're using the system installed version then there should already be be an /etc/init.d/proftpd file to allow you to do

service proftpd start

Otherwise you will need to make your own init.d script for ProFTPd - I used the one in the documentation at http://www.proftpd.org/docs/howto/Stopping.html as a starting point, put it into /etc/init.d/ and edited the FTPD_BIN and FTPD_CONF variables to point to the appropriate files for my installation.

Once this is done you should have FTP uploads working with Galaxy using PBKDF2 password authentication.

Updates: fixed typos in name of "PBKDF2" and clarify that SHA1 is not used (27/02/2015).