Thursday 13 February 2020

Setting _JAVA_OPTIONS for Trinity in Galaxy configuration

We recently encountered an issue with the Trinity tool running on the compute cluster back-end of our local production Galaxy instance: specifically, a cluster admin noticed that Trinity jobs were creating more processes than had been allocated when the jobs were submitted, resulting in overload of the nodes they'd been dispatched too.

Our Galaxy instance is configured to send Trinity jobs to a special destination defined in the job_conf.xml file:

        ...
<destination id="jse_drop_trinity" runner="jse_drop">
   <param id="qsub_options">-V -j n -l mem256 -pe smp.pe 12</param>
           <param id="galaxy_slots">12</param>
           <env id="GALAXY_MEMORY_MB">194560
</destination>
        ...
        <tool id="trinity" destination="jse_drop_trinity" />
        ...

The qsub_options are options for our Grid Engine-based submission system which dispatches Trinity to a 12-core parallel environment on one of the higher memory nodes on the cluster; the galaxy_slots option tells the job that 12 slots are available, and is passed to Trinity on start up so that it knows how many processes it can start.

These options appeared to be working correctly, so the question was then: where were the extra processes coming from? The admin identified that Trinity is actually a Java-based software package, and that the Java runtime appeared to be starting additional multiple processes for its garbage collection (a process within the Java runtime for managing memory usage and other internal book-keeping operations).

Looking at the output from a Trinity job showed the default command line:

Thursday, February 13, 2020: 10:09:18   CMD: java -Xmx64m -XX:ParallelGCThreads=2  -jar /mnt/rvmi/centaurus/galaxy/production/tool_dependencies/_conda/envs/__trinity@2.8.4/opt/trinity-2.8.4/util/support_scripts/ExitTester.jar 0

which includes -XX:ParallelGCThreads=2 and indicates that each Java process should use 2 threads for garbage collection (GC).

It's possible to override the defaults by setting the desired option in the _JAVA_OPTIONS environment variable when the job is run, and this can be done by adding a new element in the job destination for Trinity:

        <env id="_JAVA_OPTIONS">-XX:ParallelGCThreads=1</env>

(See the section on Enviroment modifications in the Galaxy documentation for more details.)

With this in place subsequent Trinity jobs behaved correctly when submitted to the compute cluster.


Wednesday 29 January 2020

Fixing "internal server error (500)" file upload failures

Users of one of our local Galaxy servers recently reported a problem with uploading files larger than a few tens of megabytes via the "Upload" interface, where the uploader would stop partway through with the message Warning: Internal server error (500).

For example:


By trial and error it was established that the maximum file size that the uploader could handle without this failure was around 65MB.

The particular server instance was running Galaxy release 19.05 and was configured to use nginx as the proxy; unfortunately there didn't appear to be any relevant error messages in the logs from either Galaxy or nginx.

However, looking at the sizes of the various logical volumes on the virtual machine hosting the Galaxy instance revealed a potential culprit:

# df -h /var
Filesystem          Size  Used Avail Use% Mounted on
/dev/mapper/lv-var  2.0G  1.8G   65M  97% /var

i.e. the available space on the /var logical volume was around the same size as the maximum size for successful file upload. Additionally, monitoring the available space under /var while uploading a file it was possible to see it shrink (and then reset as the upload either completed or failed). So it appeared that this area was being used by nginx as temporary space for the file uploads before handing the data off to Galaxy.

The nginx configuration for this server didn't explicitly set this location, but the nginx documentation includes a directive called client_body_temp_path, which defines the directory for storing temporary files holding client request bodies.

Explicitly setting this directive (in the server block of the nginx configuration) to point to a location on the virtual machine (in this case under /tmp) with more available space seemed to fix the problem:

server {
    ...
    client_body_temp_path /tmp/nginx;
    ...