SGE Instructions and Tips
Avoid interactive jobs on the head node
- Simple UNIX commands and text editor are OK.
- Jobs found running on the head node will be killed without your notice.
- Jobs are managed on rous.mit.edu using Sun Grid Engine (SGE).
- Sun Grid Engine (SGE) is an advanced job scheduler for a cluster environment.
- The main purpose of a job scheduler is to utilize system resources in the most efficient way possible.
- SGE treats every node as a queue.
- The number of slots required for each job should be specified with the "-pe" flag
- Each node provides 8 slots.
- Each user is allocated 32 slots by default.
- The process of submitting jobs to SGE is done using a script.
- Many excellent and detailed SGE usage instructions can be found online. For example, Princeton Genomics SGE page
Creating a SGE script
- The process of submitting jobs to SGE is done generally using a script. The job script allows all options and the programs/commands to be placed in a single file.
- It is possible to specify options via command line, but it becomes cumbersome when the number of options is significant.
- An example of a script that can be used to submit a job to the cluster is reported below. Start by opening a file and copy and paste the following commands, then save the file as myjob.sh or any other meaningful name. Note: Job names can not start with a number.
#!/bin/sh #$ -S /bin/sh #$ -cwd #$ -V #$ -m e #$ -M email@example.com #$ -pe whole_nodes 1 ############################################# # print date and time date # sleep for 60 seconds sleep 60 # print date and time again date
The first 7 lines specify important information about the job submitted, the rest of the file contain some simple UNIX commands (date, sleep) and comments (lines starting with #).
- The "#$" is used in the script to indicate an SGE option.
- #$ -S /bin/sh line specifies which shell to use for the job. If no shell is specified, the default user shell is used. Options include sh, bash and csh.
- -cwd specifies to run the job in the current working directory, including saving the .e and .o output files (see below) in the current directory.
- #$ -pe whole_nodes 1 specifies the number of slots (between 1 and 8 on rous) to request and reserve for the job.
- -V specifies to use the same environment variables as the submission shell.
- -m specifies when to send an email to the user (beginning, end, abort).
- -M specifies the email address to notify according to option -m. You should replace firstname.lastname@example.org with your email address.
Submitting a job
Submit your job by executing the command:
where myjob.sh is the name of the submit script. After submission, you should see the message:
- Your job XXX ("myjob.sh") has been submitted
where XXX is an auto-incremented job number assigned by the scheduler.
SGE job arrays
Often you need to run a large number of similar jobs. These jobs run the same program with different arguments, parameters, or input files. You could write a perl/python/shell script to generate all job script files and qsub them one by one. However, this is not efficient. A much better way is a SGE array job. See Simple-Job-Array-Howto for more details and example scripts that use SGE job arrays.
Monitoring a job
- To monitor the progress of your job use the command:
- To display information relative only to the jobs you submitted, use the following:
qstat -u username
where username is your username.
Submitting jobs to specific queues
- To submit your job to a specific node (queue), use the following command:
qsub -q all.q@nX myjob.sh
where X is a number specifying the node you intend to use.
- To submit your job to a subset of queues (for example n3, n4, n5), use the following command:
qsub -q all.q@n3,all.q@n4,all.q@n5 myjob.sh
Viewing job results
- Any job run on the cluster is associated with two output files (one redirected from STDOUT and one redirected from STDERR).
- These two files have a prefix (the submit job file name) and a suffix (the character "o" and "e" followed by the job number respectively for the STDOUT and STDERR).
For example, after submitting myjob.sh, any output that would normally be printed out to the screen is now redirected to:
Similarly, any error output will be directed to:
You can also redirect output within the submission script.
Deleting a job
- To stop and delete a job, use the following command:
where XXX is the job number assigned by SGE when you submit the job using qsub.
- You can only delete your jobs.
Checking the host status
- To check the status of host and his nodes, you can use the following command:
- Several information are displayed, including the architecture of each node, the number of CPUs, the total memory, the memory in use, etc.
- qrsh (rsh)
- qlogin (ssh) when you need X11 window
- Use qrsh when you compile, test and need to run program interactively
- Remember to exit cleanly from interactive sessions when done; otherwise it will be killed without your notice.