Penn Bioinformatics Core
HOME | SERVICES | TOOLS | WORKSHOPS | RESOURCES | PEOPLE  

Liniac Compute Cluster

The Genomics Institute and Bioinformatics Core support a large parallel compute cluster (currently 256 processors) for large compute intensive jobs. For more information or to report a problem, send mail to manager@genomics.upenn.edu.

Genomics Applications

A number of job managers have been written that allow users to specify scripts to run on the nodes and manage jobs (setting up the nodes and sending and retrieving data to/from the nodes). Currently managers have been written for running BLAST, RepeatMasker, ArrayOligoSelector (for designing oligos for array hybridizations) and GeneHunter-TwoLocus (GHT). Email bioinfocore@pcbi.upenn.edu if you need help getting started running these applications.

DistribJob: This job manager is a generic job manager that allows users to write modules and scripts that the job manager uses to interact with the nodes. Modules currently available support blast, repeatmasker and a variety of specific applications. This is the recommended job manager as it is extensible, is better documented and error recovery is fairly robust. Help is available by typing "/genomics/share/bin/distribjob" at the liniac prompt or "/genomics/share/bin/distribjob -help" for more complete help.

Additional DistribJob help follows:

Setting up your environment to run DistribJob » top

You must set your GUS_HOME environment variable. If you are using csh or tcsh add the following to your .login:

setenv GUS_HOME /genomics/share/pkg/gus/gushome
setenv PATH $GUS_HOME/bin:$PATH

If you are using BASH add the following to your .bash_profile:

GUS_HOME=/genomics/share/pkg/gus/gushome; export GUS_HOME
PATH=$GUS_HOME/bin:$PATH; export PATH

then log out and log in again.

Specifying your task

You provide a task to DistribJob (see configuration below).  The task determines:

  • The allowable type of input
  • How the input is subdivided into subtasks
  • How the server is to be initialized before the task starts
  • How each node is to be initialized before the task starts
  • What command to run on the node for each subtask
  • How to merge the subtask results into the main result

Chose from the built-in tasks provided by the DistribJobTasks module, or code your own task.

Constraints on input

Following are constraints that apply to your task's input:

  • The input must be conceptually an array.  The controller will provide the task with the starting and ending indices of a subtask, and the task must provide a file or files that represent the associated elements for that range of the original input.
  • The order of processing the elements must not matter
  • You may not delete or add elements from or to the input when restarting a job.

Setting up to run » top

Examples showing how to run DistribJob for the three existing genomics tasks (BlastMatrix, BlastSimilarity and RepeatMasker) can be found in $GUS_HOME//test/DJob/DistribJobTasks.

The first step is to decide where you want your input directory, and where you want your master directory.  Typically, these will go in a directory dedicated to this run of your task.  Just make sure there is enough storage space to accommodate your results.

As an example, lets say you chose to run in $HOME/myrun, and that your task is creating a BLAST matrix (one of the provided bioinformatics tasks).  Create your input directory, and copy the input file to it (in this case, a set of DNA sequences).

% mkdir -r $HOME/myrun/input

% cp myseqs.fsa $HOME/myrun/input

Configuring

Create two configuration files for your task (the standard place for them is your input directory). Example files referred to in this section are in /genomics/share/controllers/DistribJobTasks/*Input/

  • controller.prop contains the properties required by the controller.  To see sample values required for this file, look in controller_*.prop.  The properties are explained in detail in the full help for the distribjob command  (distribjob -help).  (You can name this file whatever you like, but controller.prop is the convention).
  •  task.prop contains the properties required by your task.  If you are using one of the provided bioinformatics tasks, find sample task.prop files in task.prop.  In addition, the properties are defined in Perl files which subclass Task.pm.  Look in DistribJobTasks/lib/perl.

Other resources

You may need to make other resources available to your task.  For example, the provided BlastMatrixTask and BlastSimilarityTask make use of a database of sequences to BLAST against.  This file may be very large, and so, you will not want to copy it into your input directory.  In this case, you just specify the resource location using the appropriate property in your task.prop file.

Running » top

The distribjob command starts the controller.  Running it with no arguments prints its usage.  Running it with -help prints its full help display.

You use different commands to run locally or to run on different clusters.

Regardless of how you run, the result of all the subtasks will be merged into master/mainresult.

Running locally

If you use DistribJob::LocalNode as your node type, you are running locally (ie, on your local server).  You do not need to submit your job to a cluster queue; you just run distribjob directly.  To do so, use this command (if, for example, you want to distribute across 3 virtual nodes):

% distribjob your_controller.prop 1 2 3

Running on UPenn's Liniac cluster

You are assumed to know how to use UPenn?s Liniac cluster.  To run distribjob on the Liniac, set the nodeClass property in your controller.prop file to DistribJob::BprocNode.  You.  Rather than calling distribjob directly, submit your distributed job to the Liniac?s queue by calling liniacsubmit.  Here is its usage:

% liniacsubmit nodecount minutes controllerPropFileFullPath

liniacsubmit produces as immediate output the standard queue submission report.  If you want to run on 20 nodes and estimate your job will take 10 hours, use this:

% cd where_I_want_my_log/

% liniacsubmit 20 600 /my_inputdir/controller.prop >& task.log

When the queue runs your job, it will place the job's log in the directory where you ran liniacsubmit.  Check this log (in this case called task.log) to see your job progress.

To see the status of your job on the queue, run:

% showq

Running on a different type of cluster

If you plan on running on a cluster type other than UPenn's Liniac, you or your administrator will need to provide commands that parallel liniacsubmit.

Handling problems » top

These are the kinds of problems you may encounter:

  • False starts: annoying little errors that prevent your job from really running
  • Failures:  the subtask running on a node dies because of
    • reproducible problems
      • missing executables
      • missing files or directories
      • data errors in the input
    • non-reproducible problems
      • the node goes weird
      • the command goes weird
  • Hung jobs: the subtask running on a node gets hung
  • Wrong number of nodes: you realize that you want to add more (or take away some) nodes.  To do this, kill and restart with the desired number of nodes (see below).
  • Need to kill: you realize that you need to kill your job

False starts

These are errors in which the job immediately fails, and no work is dispatched to the nodes.  The most common reason is that you have an error in a configuration file. The log should explain the problem although it might be a little less-than-obvious.

Once you have corrected the problem, you need to delete your master directory, and start again.

Failures

When a subtask running on a node fails, the log will report the failures and the job?s files are copied to a directory in master/failures/subtask_nnn/result.  Look carefully in all the files to determine the cause of the failure. 

If the problem appears to be one that will happen reproducibly, then you need to correct the source of the problem.  For example, you may need to correct your input file (but don?t delete or add elements? this will mess up the indexing used by the controller).  Or, you may need to provide missing data or executable files.

If the problem seems like a random flux of the cosmos, then you can defer correcting the source of the problem.

After you have handled all the failures, and when your job is no longer running, delete the master/failures/ directory and restart the job (see below).

Hung subtasks

The directory master/running contains subdirectories for each running subtask.  Hung subtasks will have subdirectories there whose subtask number should have long since come and gone. 

If you detect a hung subtask, you will need to kill your job and restart (see below).  You can either wait till the rest of the subtasks are complete or, if you feel that the hung subtask(s) is(are) using resources that you would rather have working for you, you can kill forthwith.

Need to kill

You may need to kill your job, either because you have hung subtasks, or because it turns out to be a virus bent on conquering the world.  There are two steps you need to take:

  1. kill the distribjob controller:
    1. kill it as soon as possible (without corrupting its results):

                                                        i.      % distribjob controllerPropFileFullPath -kill

    1. kill it without interrupting running subtasks:

                                                        i.      % distribjob controllerPropFileFullPath -killslow

  1. If you are running in a queue, kill the job in the queue.  (On UPenn?s Liniac, use the canceljob command).

Restarting » top

To restart a job, change the restart property in the controller.prop file to yes, and start the job the same way you did the first time.  Subtasks that are already complete will be skipped.  (This is controlled by the file master/ completedSubtasks.log, which is a list of completed subtasks.  If you need to redo a subtask that has already completed, at your own risk-- then you can delete its number from the list. But, remember its results might already be merged into the main result).

Coding your own task (advanced) » top

To code your own task, you will need to write a subclass of DistribJob::Task.  For samples, see:

  • $GUS_HOME/lib/perl/DJob/DistribJobTasks/SampleTask.pm
  • $GUS_HOME/lib/perl/DJob/DistribJobTasks/*Task.pm

You will also need to define the command that will run your subtasks on the nodes.  The command must:

  • Take as input a subset of the original input.  It may restrict itself to only one element of input, but would probably be more efficient if it can handle a set.  (This depends on the speed with which a single element is processed.  The subtask should run for at least a few seconds to mitigate the overhead of passing data back and forth between the server and the node.)
  • Accept full path names on all input files
  • Write all temp files to the directory in which it runs.
  • Terminate with non-zero status on any error condition that might need to be corrected. 

Using built-in nodes

DistribJob distributes the subtasks to nodes (ie, machines in a cluster).  You will specify how many compute slots each node has when you configure DistribJob (discussed below).  The DistribJob package includes three built in types of node:

  • DistribJob::PbsNode is a node in a PBS cluster (the PGI genomics cluster uses PBS).
  • DistribJob::SgeNode is a node in a Sun Grid Engine cluster.
  • DistribJob::BprocNode is a node in a BPROC cluster.
  • DistribJob::LocalNode is a virtual node running on your local machine.

The first three Node types require a cluster computing environment. The last can be run on any multi-processor or even single-processor machine, where it is efficient to have more than one subtask running at a time.

Coding your own node (guru)

If your cluster uses a process control system other than BPROC, you can still use DistribJob, but you need to write some simple code.  DistribJob::Node is the object which represents a node.  Its main purpose is to specify how to communicate between the server and node.  The details of particular types of nodes are specified by subclasses of DistribJob::Node, which is what you will need to write.  To learn how, use DistribJob::BprocNode and DistribJob::LocalNode as samples. 

You may also want to help your user by providing a cluster-specific startup script.  This script will submit a job to your cluster?s queue, and then call distribjob (see Running below) when the job is ready to run.  As a sample see /genomics/share/bin/liniacsubmit which submits a job to UPenn?s Liniac cluster.  Also see /genomics/share/bin/liniacjob, which is the script that runs.  It in turn calls distribjob.