Thursday, 13 February 2020

Setting _JAVA_OPTIONS for Trinity in Galaxy configuration

We recently encountered an issue with the Trinity tool running on the compute cluster back-end of our local production Galaxy instance: specifically, a cluster admin noticed that Trinity jobs were creating more processes than had been allocated when the jobs were submitted, resulting in overload of the nodes they'd been dispatched too.

Our Galaxy instance is configured to send Trinity jobs to a special destination defined in the job_conf.xml file:

        ...
<destination id="jse_drop_trinity" runner="jse_drop">
   <param id="qsub_options">-V -j n -l mem256 -pe smp.pe 12</param>
           <param id="galaxy_slots">12</param>
           <env id="GALAXY_MEMORY_MB">194560
</destination>
        ...
        <tool id="trinity" destination="jse_drop_trinity" />
        ...

The qsub_options are options for our Grid Engine-based submission system which dispatches Trinity to a 12-core parallel environment on one of the higher memory nodes on the cluster; the galaxy_slots option tells the job that 12 slots are available, and is passed to Trinity on start up so that it knows how many processes it can start.

These options appeared to be working correctly, so the question was then: where were the extra processes coming from? The admin identified that Trinity is actually a Java-based software package, and that the Java runtime appeared to be starting additional multiple processes for its garbage collection (a process within the Java runtime for managing memory usage and other internal book-keeping operations).

Looking at the output from a Trinity job showed the default command line:

Thursday, February 13, 2020: 10:09:18   CMD: java -Xmx64m -XX:ParallelGCThreads=2  -jar /mnt/rvmi/centaurus/galaxy/production/tool_dependencies/_conda/envs/__trinity@2.8.4/opt/trinity-2.8.4/util/support_scripts/ExitTester.jar 0

which includes -XX:ParallelGCThreads=2 and indicates that each Java process should use 2 threads for garbage collection (GC).

It's possible to override the defaults by setting the desired option in the _JAVA_OPTIONS environment variable when the job is run, and this can be done by adding a new element in the job destination for Trinity:

        <env id="_JAVA_OPTIONS">-XX:ParallelGCThreads=1</env>

(See the section on Enviroment modifications in the Galaxy documentation for more details.)

With this in place subsequent Trinity jobs behaved correctly when submitted to the compute cluster.


Wednesday, 29 January 2020

Fixing "internal server error (500)" file upload failures

Users of one of our local Galaxy servers recently reported a problem with uploading files larger than a few tens of megabytes via the "Upload" interface, where the uploader would stop partway through with the message Warning: Internal server error (500).

For example:


By trial and error it was established that the maximum file size that the uploader could handle without this failure was around 65MB.

The particular server instance was running Galaxy release 19.05 and was configured to use nginx as the proxy; unfortunately there didn't appear to be any relevant error messages in the logs from either Galaxy or nginx.

However, looking at the sizes of the various logical volumes on the virtual machine hosting the Galaxy instance revealed a potential culprit:

# df -h /var
Filesystem          Size  Used Avail Use% Mounted on
/dev/mapper/lv-var  2.0G  1.8G   65M  97% /var

i.e. the available space on the /var logical volume was around the same size as the maximum size for successful file upload. Additionally, monitoring the available space under /var while uploading a file it was possible to see it shrink (and then reset as the upload either completed or failed). So it appeared that this area was being used by nginx as temporary space for the file uploads before handing the data off to Galaxy.

The nginx configuration for this server didn't explicitly set this location, but the nginx documentation includes a directive called client_body_temp_path, which defines the directory for storing temporary files holding client request bodies.

Explicitly setting this directive (in the server block of the nginx configuration) to point to a location on the virtual machine (in this case under /tmp) with more available space seemed to fix the problem:

server {
    ...
    client_body_temp_path /tmp/nginx;
    ...

Wednesday, 29 May 2019

Fixing dataset download problems for uWSGI+nginx Galaxy configuration

We recently experienced problems downloading datasets via a web browser from one of our local Galaxy instances, which runs release 18.09 and uses a uWSGI+nginx configuration.

While small files (e.g. of the order of Mb) downloaded without problems, larger files (e.g. of the order of Gb) would fail with a dialog box appearing in the user's web browser complaining that "the source file can't be read". (The Galaxy logs also reported an IOError from uwsgi_response_write_body_do() function.)

The initial problem seemed to be with the temporary directory being used for managing the download on the server. Explicitly setting uwsgi_temp_path in the nginx configuration seemed to help, for example:

uwsgi_temp_path /tmp/uwsgi;

This got rid of the dialog box but the larger downloads still failed without completing. Although the user's browser didn't give any more information, the Galaxy logs now reported a timeout error. To address this we explicitly set the UWSGI timeout limits in the nginx configuration, e.g.:

uwsgi_read_timeout 600s;
uwsgi_write_timeout 600s;

The choice of 600s (10 minutes) was arbitrary but seemed long enough to allow the downloads to complete.

Finally as the temporary area on server is quite small, we also explicitly set the maximum size of temporary files to 1Mb:

uwsgi_max_temp_file_size 1024k;

Together these addressed the download problem in our local instance.

Thursday, 30 August 2018

Mercurial-based tool installation issues in Galaxy 18.05

Recently I've encountered a subtle problem with tool installation after upgrading our local production instances to Galaxy release 18.05, which I'd like to document here in case it comes up again in future.

The problem manifests itself when attempting to install a tool from the main toolshed via the admin interface: after clicking Install, the tool installation status goes almost immediately to Error. Further inspection reveals that the tool repository hasn't been cloned to the local filesystem, and no dependencies are installed.

Frustratingly this fails to leave any error messages in the logs which might help to diagnose the cause. However attempting the install via the Galaxy API (using nebulizer) did return an error message:

Error cloning repository: [Errno 2] No such file or directory

I was able to track this down to the clone_repository function in  lib/tool_shed/util/hg_util.py, where it is issued when something goes wrong with the hg clone ... command used in the tool installation process. hg is the name of the Mercurial version control command, and essentially the problem was that this command couldn't be found by Galaxy.

Our local Galaxy installations are configured to use supervisor with uWSGI, with the Galaxy dependencies installed into a Python virtualenv. Since this virtualenv included Mercurial, I wondered why hg wasn't being picked up from there for the tool installation process.

Marius van den Beek offered some helpful insights via the galaxy-dev mailing list which clarified the situation:
Recent galaxy releases are using the `hg` command that should be automatically installed along with other galaxy dependencies.
If you're running galaxy in a virtualenv then that virtualenv should have the `hg` script in the bin folder.
Depending on how you start galaxy you may need to add the virtualenv's `bin` folder to the `PATH`.
Based on this it turned out that I needed to add an 'environment' parameter to the supervisor.ini for Galaxy file, which to specify the virtualenv to use and add its bin directory to the PATH - something like:

environment = VIRTUAL_ENV="/srv/galaxy/venv",PATH="/srv/galaxy/venv/bin:%(ENV_PATH)s"

(This parameter is mentioned in the installation documentation, in the Scaling and Load Balancing section, but only for configuring handler processes. However since our instances are using the uwsgi + mules strategy, it didn't occur to me that it would still be needed.)

Restarting Galaxy with the updated supervisor.ini file enabled tool installation to work without problems again.

Some closing asides:

  • The problem can be masked if Mercurial is installed elsewhere on the system and is on the Galaxy user's PATH (for example /usr/bin/hg)
  • If there is a system version but it is very old (for example Scientific Linux 7 has Mercurial 1.7) then it can cause a slightly different error in clone_repository, but the outcome and fix should be the same as above
  • Since first encountering this issue I've come across a strange variant, whereby Mercurial is installed in the Galaxy virtualenv and supervisor is correctly configured but the tool installations still fail immediately. In this case for some unknown reason it turned out that the hg script in the virtualenv wasn't executable - adding 'execute' permission fixed this one.

Tuesday, 25 April 2017

Securing Galaxy with HTTPS running with Nginx using Let’s Encrypt

Background

To secure communication between a Galaxy instance and its users it is best to enable HTTPS on the Galaxy web server, to ensure that all data transmissions between Galaxy and the end user (including sensitive information such as usernames and passwords) are encrypted. This can done by obtaining and installing SSL/TLS certificates on the server.

The simplest approach in the past was to use self-signed certificates as a way to enable HTTPS while avoiding the cost of purchasing certificates from a commercial Certificate Authority (CA) (for example by using the make-dummy-certs utility found in e.g. /usr/ssl/certs). The downside of this approach is that when a user first tries to access the server their web browser will complain that the certificates are not trusted, and they would typically have to create a one-off security exception before they can access the Galaxy service.

More recently however, a free Certificate Authority called Let’s Encrypt (https://letsencrypt.org/has been set up which issues free certificates as part of its stated mission to “secure the web”. This blog post gives an overview of how we obtained and installed certificates from Let's Encrypt to enable HTTPS for our production Galaxy instances, using their automated cert-bot client utility.

Before beginning

The procedure described below uses the 'webroot' plugin of cert-bot (see https://certbot.eff.org/docs/using.html#webroot), which is a general method recommended for obtaining certificates web servers running nginx. cert-bot also has a plugin for nginx but at the time of writing this is still at alpha-release stage so I didn't use it for our Galaxy servers (see https://certbot.eff.org/docs/using.html#nginx for more details).

For Apache-based servers you can use a dedicated plugin described at https://certbot.eff.org/docs/using.html#apache, which offers a more automated procedure than the one described here.

Also, although it targets a different operating system to ours and while many of the details are now out-of-date, DigitalOcean's how-to guide at https://www.digitalocean.com/community/tutorials/how-to-secure-nginx-with-let-s-encrypt-on-ubuntu-14-04 is still a useful resource and was immensely helpful to me for understanding the overall process.

Finally, please note that the procedure and its details are likely to change over time. Make sure you check the documentation before carrying out any of these operations on your own infrastructure!

Step 1: Install cert-bot (Let’s Encrypt client) on the server

To begin you need to ensure the Let's Encrypt cert-bot utility (https://certbot.eff.org/) is available on the server, to perform the job of obtaining and installing the certificates.

The documentation recommends that if possible you should use the cert-bot package provided by the package manager for your system (e.g. yum, apt etc). However if one isn't available (or is unsuitable e.g. because it's out-of-date) then you can install the client using the certbot-auto wrapper script instead (see https://certbot.eff.org/docs/install.html#certbot-auto). This is the approach I used, putting certbot-auto into /usr/local/bin on the server running Galaxy and nginx.

(Note that certbot-auto takes the same arguments as the cert-bot utility, the only difference is that if necessary it will download and update itself first each time it's run.)

  • Aside: there is also a cert-bot package available via the Python Package Index (PyPI). When I first performed this procedure I noted that the documentation emphasised that cert-bot should not be installed 'pip install', but now I can't find any reference to this. However I would still avoid installing from PyPI for the time being.

Step2: Get certificates using the 'webroot' method

cert-bot provides a number of different ways to obtain certificates depending on the webserver software being used. The 'webroot' protocol used here is less automated than some of the other procedures but is still quite straightforward, and works by placing a special file on your webserver which Let's Encrypt can attempt to fetch in order to verify the server name and details that are supplied when the cert-bot client is run.

First we need to set up a special directory called .well-known, where Let's Encrypt will place its file:
  • Create a directory called .well-known in the document root of the server (the default for nginx is /usr/share/nginx/hmtl but the actual path can be found by looking up the value of webroot-path in the server configuration), e.g.:

    mkdir /usr/share/nginx/hmtl/.well-known

    Optionally also add a dummy index file to help check that the directory is visible via web browser later, e.g.:

    cat >/usr/share/nginx/hmtl/.well-known/index.html <<EOF
    Hello world!
    EOF
  • Add a new location block inside the server block in the nginx configuration file, to allow access to the .well-known directory:

    location ~ /.well-known {
        allow all;
    }
  • Restart nginx and check that the .well-known directory is visible (e.g. by pointing a web browser at it)
Then we need to run certbot-auto (or cert-bot) interactively to generate and install the certificates:
  • sudo certbot-auto certonly --webroot -w /usr/share/nginx/html -d MYDOMAIN
where MYDOMAIN is the domain name of your Galaxy server (e.g. "palfinder.ls.manchester.ac.uk").

  • Aside: note that this bootstraps certbot, including checking for the system packages that it requires; you'll be prompted to install any that it thinks are missing via the system package manager e.g. yum.

certbot will then prompt you to agree to Let's Encrypt's terms and conditions and ask you to provide an email address which will be used for notices and for lost key recovery.

If all goes well then this should produce a set of certificate files under /etc/letsencrypt/archive (with links to these from /etc/letsencrypt/live/):
  • cert.perm (your domain's certificate)
  • chain.pem (the Let's Encrypt chain certificate)
  • fullchain.pem (cert.pem and chain.pem combined)
  • privkey.pem (your certificate's private key)
  • IMPORTANT: you should ensure that the certificate files are backed up to a secure and safe location!
Step 3: configure TLS/SSL on nginx using the certificates

To enable HTTPS we need to configure nginx to listen on port 443 with SSL enabled, and to use the certificates from Let's Encrypt. This is done by adding the following to the server block in the nginx configuration file, for example:

server {
    listen 443;
    ssl on;
    ssl_certificate /etc/letsencrypt/live/MYDOMAIN/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/MYDOMAIN/privkey.pem;
}

(Again, your actual domain name should be substituted for MYDOMAIN above.)

It's also a good idea to block or redirect HTTP traffic, so that users don't accidentally send data via an insecure connection - for example to redirect to :

server {
    listen 80 default;
    server_name MYDOMAIN;
    rewrite ^ https://$server_name$request_uri? permanent;
}

Once nginx is restarted you can check using your browser that HTTPS is working for your Galaxy instance; you can also use the Qualys SSL Labs website to check your server configuration:

https://www.ssllabs.com/ssltest/index.html

(NB this is useful for flagging up other issues which you might wish to address!)

Step 4: set up automated certificate renewal

Finally: since all certificates issued by Let's Encrypt expire after 90 days, they recommend that they should be renewed at least once every 3 months.

It's straightforward to automate this process by setting up a cron job on the server to run cert-bot or certbot-auto's 'renew' command (which will renew any previously-obtained certificates that are due to expire in less than 30 days) and then restart nginx (so that any renewed certificates will be loaded).

For example I have the following commands in the root crontab on our server:

# Check SSL certificate renewal from Let's Encrypt

30 2 * * 1 /usr/local/bin/certbot-auto renew >> /var/log/le-renew
# Restart nginx after SSL certificate renewal
35 2 * * 1 service nginx restart >> /var/log/le-renew

See the documentation at https://certbot.eff.org/docs/using.html#renewing-certificates for more information on certificate renewal.

Update 22nd October 2018: the original crontab lines above didn't work for me - the certificate renewals would fail and have to be performed manually, resulting in downtime for the period when nginx would no longer have valid SSL certificates.

Since then I've replaced the original crontab lines with the following single line:

30 2 * * 1 /usr/local/bin/certbot-auto renew --deploy-hook "/sbin/service nginx reload" >> /var/log/le-renew


which uses certbot-auto's --deploy-hook option to reload the nginx configuration on successful certificate renewal via the service command. Note that the full path to service is required as cron jobs have a minimal PATH which doesn't seem to include /sbin.

Tuesday, 2 June 2015

Exposing Galaxy reports via nginx in a production instance

Galaxy includes a report tool that is separate from the main process but which gives lots of potentially useful information about the usage of a Galaxy instance, for example the numbers of jobs that have been run each month, how much disk space each user is currently consuming and so on.

However there doesn't appear to be much documentation about the report tool on the official Galaxy wiki: the most I could find was a rather sparse page at https://wiki.galaxyproject.org/Admin/UsageReports, which gives a very bare bones overview, and doesn't include any information on how it might be exposed in a secure manner in a production environment. Therefore in this post I outline how I've done this for our local Galaxy set up, which uses nginx; however I imagine it could be adapted to work with Apache instead.

1. Set up up the report tool to run on localhost

The report tool takes its configuration settings from a file called reports_wsgi.ini, which is located in the config subdirectory of the Galaxy distribution.

Configuring the reports for your local setup is a case of:
  • Making a copy of reports_wsgi.ini.sample called reports_wsgi.ini
  • Editing the database_connection and file_path (if not the default) parameters to match those in your galaxy.ini (or universe_wsgi.ini) file
  • Optionally, editing the port parameter (by default the tool uses port 9001)
  • You should also set the 'salt' parameter session_secret if you intend to expose the reports via the web proxy (see below)
Then you can start the report server using

sh run_reports.sh

and view the reports by pointing a web browser running on the same server to http://127.0.0.1:9001.

If you'd like the report tool to persist between sessions then use

sh run_reports.sh --daemon

to run it as a background process. As with Galaxy itself, use --stop-daemon to halt the background process. (The log file is written to reports_webapp.log if you need to try and debug a problem.)

2. Expose the report tool via nginx

If you're running a production Galaxy and want to be able access the reports from a browser running on a different machine to your Galaxy server then you can could consider using SSH tunnelling, which essentially forwards a port on your local machine to one on the server i.e. port 9001 where the report tool is serving from (see "SSH Tunneling Made Easy" at http://www.revsys.com/writings/quicktips/ssh-tunnel.html for more details of how to do this).

Alternatively if you are using a web proxy (as is standard for a production setup) then you could try serving the reports also via the proxy (in this case nginx). In this example I assume that if Galaxy is being served from e.g. http://galaxy.example.org/ then the reports will be viewed via http://galaxy.example.org/reports/.

First, make the appropriate edits to reports_wsgi.ini: if you have an older Galaxy instance then you'll need to add some sections to the file, specifically:

[filter:proxy-prefix]
use = egg:PasteDeploy#prefix
prefix = /reports

(before the [app:main] section), and

filter-with = proxy-prefix
cookie_path = /reports

(within the [app:main] section.)

For more recent latest Galaxy instances it's simply a case of making sure that the existing filter-with and cookie_path lines are uncommented and set to the values above.

Next it's necessary to add upstream and location sections in your nginx.conf file:

(This has many similarities to serving Galaxy from a subdirectory via nginx proxy at a subdirectory, see https://wiki.galaxyproject.org/Admin/Config/nginxProxy).

One important thing to be aware of is that the report tool doesn't include any built-in authentication, so it's recommended that you add some authentication within the web proxy. Otherwise anyone in the world could potentially access the reports for your server and see sensitive information such as user login names.

To do this with nginx, first create a htpasswd file to hold a set of user names and associated passwords, using the htpassword utility, e.g.:

htpasswd -c /etc/nginx/galaxy-reports.htpasswd admin

-c means create a new file (in this case /etc/nginx/galaxy-reports.htpasswd); admin is the username to add. The program will prompt for a password for that username, and store it in the file. You can use any username, and any filename or location (with the caveat that it must be readable by the nginx process) that you wish.

Finally to associate the password file with the reports location update the nginx config file appropriately by adding two more lines:

(I found this article very helpful here; note that it also works for https in spite of the title: "How to set up http authentication with nginx on Ubuntu 12.10" https://www.digitalocean.com/community/tutorials/how-to-set-up-http-authentication-with-nginx-on-ubuntu-12-10).

Once nginx has been restarted then anyone attempting to view the reports at http://galaxy.example.org/reports/ will be prompted to enter a username/password combination matching an entry in the htpasswd file before they are given access. Authorised users can then peruse the reports to their heart's content.

Wednesday, 22 April 2015

Using GALAXY_SLOTS with multithreaded Galaxy tools

GALAXY_SLOTS is a useful but not particularly well-publicised way of controlling the number of threads Galaxy allocates to a tool that supports multithreaded operation. It's relevant to both Galaxy admins (who need to ensure that multithreaded jobs don't try to consume more resources than they have access to) and to tool developers (who need to know how many threads are available to a tool at runtime).

Having seen various references to GALAXY_SLOTS on the developer's mailing list I'd assumed this was some esoteric feature that I would need to set up to use, but in actual fact it's almost embarrassingly simple for most cases. Essentially it can be thought of as an internal variable that's set by Galaxy when it starts a job, which indicates the number of threads that are available for that job and which can subsequently be accessed by a tool in order to make use of that number of threads.

The official documentation can be found here: https://wiki.galaxyproject.org/Admin/Config/GALAXY_SLOTS
and this covers the essential details, but the executive summary is:
  • Tool developers should use GALAXY_SLOTS when specifying the number of threads a tool should run with;
  • Galaxy admins shouldn't need to configure anything unless they're using the local runner, or (possibly) a novel cluster submission system.
And really, that's it. However the following sections give a bit more detail for those who like to have it spelled out (like me).

For tool developers

All that is required for tool developers is to specific GALAXY_SLOTS in the <command> tag in the tool XML wrapper, when setting the number of threads the tool uses.

The syntax for specifying the variable is:

\${GALAXY_SLOTS:-N}

where N is the default value to use if GALAXY_SLOTS is not set. (See the "Tool XML File syntax" documentation for the tag at https://wiki.galaxyproject.org/Admin/Tools/ToolConfigSyntax#A.3Ccommand.3E_tag_set for more details - you need to scroll down to the section on "Reserved Variables" to find it.)

For example, here's a code fragment from the XML wrapper from a tool to run the Trimmomatic program:


The number of threads defaults to 6 unless GALAXY_SLOTS is explicitly set.

(Aside: the Trimmomatic tool itself can be obtained from the toolshed at https://toolshed.g2.bx.psu.edu/view/pjbriggs/trimmomatic)

For Galaxy Admins

It turns out that generally there is nothing special to do for most cluster systems, although this is not immediately clear from the documentation: in most cases GALAXY_SLOTS is handled automagically and so doesn't require any explicit configuration.

For example for DRMAA (which is what we're using locally), we have job runners defined in our job_conf.xml file like:


In our set up, -pe smp.pe 4 above requests 4 cores for the job. When using this runner, Galaxy will automagically determine the number of cores from DRMAA (i.e. 4) and set GALAXY_SLOTS to the appropriate value - nothing more to do.

The most obvious exception is the "local" job runner, where you need to explicitly set the number of available slots using the <param id="local_slots"> tag in job_conf.xml; see https://wiki.galaxyproject.org/Admin/Config/Performance/Cluster#Local for more details.

Finally, for other job submission systems see the documentation on how to verify that the environment is being set correctly.