Wednesday 22 April 2015

Using GALAXY_SLOTS with multithreaded Galaxy tools

GALAXY_SLOTS is a useful but not particularly well-publicised way of controlling the number of threads Galaxy allocates to a tool that supports multithreaded operation. It's relevant to both Galaxy admins (who need to ensure that multithreaded jobs don't try to consume more resources than they have access to) and to tool developers (who need to know how many threads are available to a tool at runtime).

Having seen various references to GALAXY_SLOTS on the developer's mailing list I'd assumed this was some esoteric feature that I would need to set up to use, but in actual fact it's almost embarrassingly simple for most cases. Essentially it can be thought of as an internal variable that's set by Galaxy when it starts a job, which indicates the number of threads that are available for that job and which can subsequently be accessed by a tool in order to make use of that number of threads.

The official documentation can be found here: https://wiki.galaxyproject.org/Admin/Config/GALAXY_SLOTS
and this covers the essential details, but the executive summary is:
  • Tool developers should use GALAXY_SLOTS when specifying the number of threads a tool should run with;
  • Galaxy admins shouldn't need to configure anything unless they're using the local runner, or (possibly) a novel cluster submission system.
And really, that's it. However the following sections give a bit more detail for those who like to have it spelled out (like me).

For tool developers

All that is required for tool developers is to specific GALAXY_SLOTS in the <command> tag in the tool XML wrapper, when setting the number of threads the tool uses.

The syntax for specifying the variable is:

\${GALAXY_SLOTS:-N}

where N is the default value to use if GALAXY_SLOTS is not set. (See the "Tool XML File syntax" documentation for the tag at https://wiki.galaxyproject.org/Admin/Tools/ToolConfigSyntax#A.3Ccommand.3E_tag_set for more details - you need to scroll down to the section on "Reserved Variables" to find it.)

For example, here's a code fragment from the XML wrapper from a tool to run the Trimmomatic program:


The number of threads defaults to 6 unless GALAXY_SLOTS is explicitly set.

(Aside: the Trimmomatic tool itself can be obtained from the toolshed at https://toolshed.g2.bx.psu.edu/view/pjbriggs/trimmomatic)

For Galaxy Admins

It turns out that generally there is nothing special to do for most cluster systems, although this is not immediately clear from the documentation: in most cases GALAXY_SLOTS is handled automagically and so doesn't require any explicit configuration.

For example for DRMAA (which is what we're using locally), we have job runners defined in our job_conf.xml file like:


In our set up, -pe smp.pe 4 above requests 4 cores for the job. When using this runner, Galaxy will automagically determine the number of cores from DRMAA (i.e. 4) and set GALAXY_SLOTS to the appropriate value - nothing more to do.

The most obvious exception is the "local" job runner, where you need to explicitly set the number of available slots using the <param id="local_slots"> tag in job_conf.xml; see https://wiki.galaxyproject.org/Admin/Config/Performance/Cluster#Local for more details.

Finally, for other job submission systems see the documentation on how to verify that the environment is being set correctly.

2 comments:

  1. Thanks for explaining the GALAXY_SLOTS. Suppose if I have defined 4 threads in job_conf.xml files while it is set to 1 (GALAXY_SLOTS:-1) in tool wrapper. Which one it would pick?

    ReplyDelete
    Replies
    1. The definition in the tool wrapper sets the default number of threads in the absence of any additional configuration, however I believe that this is over-ridden by the definition in the job_conf.xml file.

      So in your example, GALAXY_SLOTS would be set to 4 threads when the tool was run.

      HTH

      Delete