| HOME | SERVICES | TOOLS | WORKSHOPS | RESOURCES | PEOPLE |
The Genomics
Institute and Bioinformatics Core support a large parallel
compute cluster (currently 256 processors)
for large compute intensive jobs. For more information
or to report a problem, send mail to manager@genomics.upenn.edu.
A number of job managers have been written that allow users to specify scripts to run on the nodes and manage jobs (setting up the nodes and sending and retrieving data to/from the nodes). Currently managers have been written for running BLAST, RepeatMasker, ArrayOligoSelector (for designing oligos for array hybridizations) and GeneHunter-TwoLocus (GHT). Email bioinfocore@pcbi.upenn.edu if you need help getting started running these applications.
DistribJob: This job manager is a generic job manager that allows users to write modules and scripts that the job manager uses to interact with the nodes. Modules currently available support blast, repeatmasker and a variety of specific applications. This is the recommended job manager as it is extensible, is better documented and error recovery is fairly robust. Help is available by typing "/genomics/share/bin/distribjob" at the liniac prompt or "/genomics/share/bin/distribjob -help" for more complete help.
Additional DistribJob help follows:
You must set your GUS_HOME environment variable. If you are using csh or tcsh add the following to your .login:
setenv GUS_HOME /genomics/share/pkg/gus/gushome
setenv PATH $GUS_HOME/bin:$PATH
If you are using BASH add the following to your .bash_profile:
GUS_HOME=/genomics/share/pkg/gus/gushome; export GUS_HOME
PATH=$GUS_HOME/bin:$PATH; export PATH
then log out and log in again.
You provide a task to DistribJob (see configuration below). The task determines:
Chose from the built-in tasks provided by the DistribJobTasks module, or code your own task.
Following are constraints that apply to your task's input:
Examples showing how to run DistribJob for the three existing genomics tasks (BlastMatrix, BlastSimilarity and RepeatMasker) can be found in $GUS_HOME//test/DJob/DistribJobTasks.
The first step is to decide where you want your input directory, and where you want your master directory. Typically, these will go in a directory dedicated to this run of your task. Just make sure there is enough storage space to accommodate your results.
As an example, lets say you chose to run in $HOME/myrun, and that your task is creating a BLAST matrix (one of the provided bioinformatics tasks). Create your input directory, and copy the input file to it (in this case, a set of DNA sequences).
% mkdir -r $HOME/myrun/input
% cp myseqs.fsa $HOME/myrun/input
Create two configuration files for your task (the standard place for them is your input directory). Example files referred to in this section are in /genomics/share/controllers/DistribJobTasks/*Input/
You may need to make other resources available to your task. For example, the provided BlastMatrixTask and BlastSimilarityTask make use of a database of sequences to BLAST against. This file may be very large, and so, you will not want to copy it into your input directory. In this case, you just specify the resource location using the appropriate property in your task.prop file.
The distribjob command starts the controller. Running it with no arguments prints its usage.
Running it with -help
prints its full help display.
You use different commands to run locally or to run on different
clusters.
Regardless of how you run, the result of all the subtasks will be merged into master/mainresult.
If you use DistribJob::LocalNode as your node type, you are running locally (ie, on your local server). You do not need to submit your job to a cluster queue; you just run distribjob directly. To do so, use this command (if, for example, you want to distribute across 3 virtual nodes):
% distribjob your_controller.prop 1 2 3
You are assumed to know how to use UPenn?s Liniac cluster. To run distribjob on the Liniac, set the nodeClass property in your controller.prop file to DistribJob::BprocNode. You. Rather than calling distribjob directly, submit your distributed job to the Liniac?s queue by calling liniacsubmit. Here is its usage:
% liniacsubmit nodecount minutes controllerPropFileFullPath
liniacsubmit produces as immediate output the standard queue submission report. If you want to run on 20 nodes and estimate your job will take 10 hours, use this:
% cd where_I_want_my_log/
% liniacsubmit 20 600 /my_inputdir/controller.prop
>& task.log
When the queue runs your job, it will place
the job's log in the directory where you ran liniacsubmit. Check
this log (in this case called task.log) to see your job progress.
To see the status of your job on the queue, run:
% showq
If you plan on running on a cluster type other than UPenn's Liniac, you or your administrator will need to provide commands that parallel liniacsubmit.
These are the kinds of problems you may encounter:
These are errors in which the job immediately fails, and no
work is dispatched to the nodes.
The most common reason is that you have an error in a configuration
file. The log should explain the problem although it might be
a little less-than-obvious.
Once you have corrected the problem, you need to delete your master directory, and start again.
When a subtask running on a node fails, the log will report
the failures and the job?s files are copied to a directory in
master/failures/subtask_nnn/result. Look carefully in all the files to determine
the cause of the failure.
If the problem appears to be one that will happen reproducibly,
then you need to correct the source of the problem. For example, you may need to correct your input
file (but don?t delete or add elements? this will mess up the
indexing used by the controller).
Or, you may need to provide missing data or executable
files.
If the problem seems like a random flux of the cosmos, then
you can defer correcting the source of the problem.
After you have handled all the failures, and when your job is no longer running, delete the master/failures/ directory and restart the job (see below).
The directory master/running contains subdirectories
for each running subtask. Hung
subtasks will have subdirectories there whose subtask number should
have long since come and gone.
If you detect a hung subtask, you will need to kill your job
and restart (see below). You
can either wait till the rest of the subtasks are complete or,
if you feel that the hung subtask(s) is(are) using resources that
you would rather have working for you, you can kill forthwith.
You may need to kill your job, either because you have hung subtasks, or because it turns out to be a virus bent on conquering the world. There are two steps you need to take:
i.
% distribjob controllerPropFileFullPath
-kill
i.
% distribjob controllerPropFileFullPath
-killslow
To restart a job, change the restart property in the controller.prop file to yes, and start the job the same way you did the first time. Subtasks that are already complete will be skipped. (This is controlled by the file master/ completedSubtasks.log, which is a list of completed subtasks. If you need to redo a subtask that has already completed, at your own risk-- then you can delete its number from the list. But, remember its results might already be merged into the main result).
To code your own task, you will need to write a subclass of DistribJob::Task. For samples, see:
You will also need to define the command that will run your subtasks on the nodes. The command must:
DistribJob distributes the subtasks to nodes (ie, machines in a cluster). You will specify how many compute slots each node has when you configure DistribJob (discussed below). The DistribJob package includes three built in types of node:
The first three Node types require a cluster computing environment. The last can be run on any multi-processor or even single-processor machine, where it is efficient to have more than one subtask running at a time.
If your cluster uses a process control system
other than BPROC, you can still use DistribJob, but you need
to write some simple code. DistribJob::Node is the object
which represents a node. Its
main purpose is to specify how to communicate between the server
and node. The details of particular types of nodes
are specified by subclasses of DistribJob::Node, which is what you will need to write. To
learn how, use DistribJob::BprocNode and DistribJob::LocalNode as
samples.
You may also want to help your user by providing a cluster-specific startup script. This script will submit a job to your cluster?s queue, and then call distribjob (see Running below) when the job is ready to run. As a sample see /genomics/share/bin/liniacsubmit which submits a job to UPenn?s Liniac cluster. Also see /genomics/share/bin/liniacjob, which is the script that runs. It in turn calls distribjob.