Job Jar User Guide

Overview

The Job Jar is a simple batch queuing system for Unix. Its main distinguishing feature is that there is no central daemon. Instead, an arbitrary set of workers cooperatively claim jobs from a central directory. A job is any Unix executable file (usually scripts). Jobs are run in a fresh directory, and with a controlled environment that includes information such as the path to that directory.

System Requirements

Optional System Components

Installation

Deciding Which User Will Run The Job Jar

Any user may run a Job Jar system. No special privileges are required, aside from read and write permission in the Job Jar installation directory (or whichever directories you have configured the system to run in; see Configuration). You may wish to create a user specifically to run the Job Jar system, for auditing purposes. If so, it is easiest to create and log in as that user before proceeding.

Unpacking The Distribution

A standard Job Jar installation is simply a directory hierarchy, located on a disk that can be accessed by all potential workers.

Assuming the Job Jar distribution is in the compressed tar file jobjar-1_0_0.tgz, to unpack the Job Jar distribution under the directory /opt, type:

cd /opt
gunzip jobjar-1_0_0.tgz | tar xvf -

This will create and populate the directory /opt/jobjar-1_0_0.

Contents Of The Distribution

Unpacking the distribution will create the following subdirectories:

  • bin: contains programs you will run directly
  • crontabs: contains crontabs for use with the cron daemon
  • documents: contains documentation for the Job Jar system
  • workers: the default location for workers to create their own directories as they run, containing logs, job-specific scratch directories, and other worker-specific information
  • unclaimed: the default location for unclaimed jobs
  • claimed: the default location for claimed jobs
  • completed: the default location for completed jobs
  • limbo: the default temporary location for jobs that are about to be submitted
  • jobjar: contains a Python package for interfacing with the Jab Jar system from Python
  • stranded: the default location for jobs recovered by the recover_jobs program

In the rest of this manual, it will be assumed that the Job Jar is installed under the directory named by the environment variable JJ_HOME .

Configuring Your Installation

The Job Jar system needs no special configuration to run. However, you can change where workers look for jobs, and control some aspects of the system's behavior, through environment variables and configuration files. See the section Configuration for details.

Installing The Python Client Library

The Job Jar distribution includes a Python library for interfacing with the Job Jar system from Python. You can use this library from the initial installation directory, or install it in a directory in your PYTHONPATH. If you use the Python client library from the Job Jar installation directory, it will automatically infer the location of the job directories. If you install the Python client library elsewhere, you must tell it where the job and worker directories are. To do this, either:

  • Set the configuration variable JJ_HOME, and possibly JJ_WORKER_HOME, or
  • Call the function set_home, and possibly set_worker_home, in the Python client library

See the section Configuration for more details.

The Cron Jobs

The crontabs directory in a Jab Jar installation contains example crontab files which you can use to:

  • Automatically restart workers if they quit unexpectedly, for example if there is a power failure or a worker node reboots;
  • Automatically recover jobs stranded by crashed workers

A cron job is only effective on the nodes on which it is installed. It cannot restart workers on other nodes, or find stranded jobs that were claimed by workers on other nodes. Thus, the cron job must be run on each node on which you wish to restart workers, or recover stranded jobs, automatically.

You must edit an example crontab file before submitting it to cron, to fill in the path to the Job Jar installation. Additionally, cron jobs must be run in an environment that contain the Python interpreter in the default search path.

Concepts and Terms

Node: A single computer on which one or more Job Jar workers run

Worker: A long-lived process running on a single node, which claims and runs jobs from the unclaimed jar

Job: Any Unix executable which has been submitted to the Job Jar system

Seed Job: A job which submits other jobs to the Job Jar system when run

Priority: An integer associated with a job, which determines in which order the job will be claimed relative to other jobs (higher-priority jobs are claimed before lower-priority jobs)

Category: A string associated with a job, which determines which workers will attempt to claim the job

Limbo: A directory containing jobs that have not yet been submitted to the unclaimed jar

Unclaimed Jar: A directory containing jobs that have not yet been claimed by any worker

Claimed Jar: A directory containing symbolic links to jobs that have been claimed by some worker, but not yet completed

Completed Jar: A directory containing jobs that have been run to completion by a worker

Pause: Temporarily suspend a worker without causing it to exit

Stop: Cause a worker process to exit

Stranded Job: A job that was claimed by a worker, which subsequently crashed (as from a system reboot).

See the section The Worker Lifecycle for more information about workers.

Starting the Job Jar System

To start the Job Jar system, all you need to do is start at least one worker.

Using The start_worker Program

To start a worker on a node, run the program bin/start_worker . A background worker process will be started and will begin looking in the unclaimed jar for jobs. Type:

cd $JJ_HOME/bin
./start_worker

You may supply the start_worker program with a list of categories that this worker is allowed to process. Type:

cd $JJ_HOME/bin
./start_worker LINUX AMD_OPTERON

This command creates a worker that will claim jobs which are categorized as LINUX and AMD_OPTERON jobs, in addition to claiming uncategorized jobs . See the section Job Categories for more information about using categories.

If no categories are specified, the worker will only claim uncategorized jobs and jobs in the implicit category.

Any number of workers may be started on a given node. There is no explicit load balancing in the Job Jar system, so choose the number of workers you want to run based on:

  • The computational load you expect from any given job
  • How much of the node you wish to dedicate to batch processing

Using The check_workers Program

The check_workers program inspects the worker directories in a Job Jar installation. It checks to see how many workers are running on the current node (the node on which the check_workers program is run). If fewer than the expected number of workers are running, check_workers starts as many workers as necessary to bring the number of running workers up to the expected number. The expected number of workers for a given node is determined by the JJ_MAX_WORKERS configuration parameter (see Configuration).

The check_workers program is what the worker.cron example crontab file invokes. You can also run check_workers manually, which can be useful if you wish to manage a Job Jar system without the use of crontabs. Type:

cd $JJ_HOME/bin
./check_workers

Stopping The Job Jar System

Workers in a Job Jar system may only be stopped between jobs. You cannot interrupt a running worker without manually killing the job it is running.

When a worker is stopped, it first finishes the job it is currently running. Then it moves the job to the completed jar, removes the job-specific scratch directory it created when it started the job, and exits.

Using The stop_worker Program

The program $JJ_HOME/bin/stop_worker stops some or all of the workers in a Job Jar system.

To stop all currently-running workers, type:

cd $JJ_HOME/bin
./stop_worker all

New workers created after you run stop_worker all will not be stopped (e.g. if you are running the Job Jar system with cron jobs to restart workers). The command stop_worker all actually creates STOP files in each individual worker's directory, rather than a STOP_ALL file in the JJ_WORKER_HOME directory. See the section Stopping The System Manually for information about STOP and STOP_ALL files.

To stop all currently-running workers on a particular node, type:

cd $JJ_HOME/bin
stop_worker NODE1 NODE2 ...

Replace NODE1, etc., with actual node names.

To stop particular workers, type:

cd $JJ_HOME/bin
stop_worker NODE1.PID1 NODE2.PID2 ...

Replace NODE1.PID1, etc., with the name of the worker directory for the worker you want to stop.

You can list both nodes and individual workers when invoking stop_worker. For example:

cd $JJ_HOME/bin
stop_worker rain flower.13764 meadow 

The previous command will stop all workers on nodes rain and meadow, as well as the single worker on node flower with process ID 13764.

Stopping The System From The Python Client Library

The Python client library contains the following functions to stop the Job Jar system:

  • stop_all_workers(): Stop all currently-running workers. This is equivalent to running the stop_worker program with the argument all .
  • stop_worker(spec, spec, ...): Stop all currently-running workers that are described by the arguments. Each argument is a string of the form <node> or <node>.<pid> . This is equivalent to running the stop_worker program with the same arguments.

Stopping The System Manually

To stop an individual worker in a Job Jar system, create a file named STOP in the worker's directory. For example, assuming a worker is running on node "flower" with process ID 5123, type:

cd $JJ_HOME/workers/flower.5123
touch STOP

When the worker finishes the job it is currently running, it will move the job to the completed jar, remove the job-specific scratch directory it created when it started the job, and exit.

If you wish to stop all workers in a running Job Jar system, you do not need to stop each worker individually. You can stop the entire system by creating a file named STOP_ALL in the JJ_WORKER_HOME directory. Type:

cd $JJ_WORKER_HOME
touch STOP_ALL

As each worker checks for jobs, it will notice this file and exit. This is equivalent to creating a STOP file in each individual worker directory, but more convenient.

Note: A STOP_ALL file will cause new workers to stop before claiming any jobs. If you are running a Job Jar system with cron jobs to automatically restart workers, each new worker the cron job starts will stop immediately without doing any work. Use STOP_ALL with care.

Pausing And Unpausing The Job Jar System

Workers may be paused between jobs. When a worker is paused, it first finishes the job it is currently running. Then it moves the job to the completed jar, removes the job-specific scratch directory it created when it started the job, and begins sleeping. It does not claim jobs from the unclaimed jar, but it does not exit completely. A paused worker periodically wakes up to check whether it should remain paused, resume work, or stop.

Stopping has priority over pausing. That is, if a STOP or STOP_ALL file is present, the worker will exit rather than pausing.

Using The pause_worker Program

The program $JJ_HOME/bin/pause_worker pauses some or all of the workers in a Job Jar system.

To pause all currently-running workers, type:

cd $JJ_HOME/bin
./pause_worker all

New workers created after you run pause_worker all will not be paused. The command pause_worker all actually creates PAUSE files in each individual worker's directory, rather than a PAUSE_ALL file in the JJ_WORKER_HOME directory. See the section Pausing And Unpausing The System Manually for information on PAUSE and PAUSE_ALL files.

To pause all currently-running workers on a particular node, type:

cd $JJ_HOME/bin
pause_worker NODE1 NODE2 ...

Replace NODE1, etc., with actual node names.

To pause individual workers, type:

cd $JJ_HOME/bin
pause_worker NODE1.PID1 NODE2.PID2 ...

Replace NODE1.PID1, etc., with the name of the worker directory for the worker you want to pause.

You can list both nodes and individual workers when invoking pause_worker. For example:

cd $JJ_HOME/bin
pause_worker rain flower.996 meadow 

The previous command will pause all workers on nodes rain and meadow, as well as the single worker on node flower with process ID 996.

Using The unpause_worker Program

The program $JJ_HOME/bin/unpause_worker unpauses some or all of the workers in a Job Jar system. When a worker is unpaused, it will begin claiming jobs from the unclaimed jar at some time in the future, unless it has been stopped.

The unpause_worker program takes all the same arguments as the pause_worker program. It is fine to unpause a worker that was not paused before. For example, you can use unpause_worker all after pausing just one or two workers, rather than unpausing the individual workers.

Pausing And Unpausing The System From The Python Client Library

The Python client library contains the following functions to pause and unpause the Job Jar system:

  • pause_all_workers(): Pause all currently-running workers in the Job Jar system. This is equivalent to running the pause_worker program with the argument all .
  • pause_worker(spec, spec, ...): Pause all currently-running workers that are described by the arguments. Each argument is a string of the form <node> or <node>.<pid> . This is equivalent to running the pause_worker program with the same arguments.
  • unpause_all_workers(): Unpause all currently-paused workers in the Job Jar system. This is equivalent to running the unpause_worker program with the argument all.
  • unpause_worker(spec, spec, ...): Unpause all currently-pased workers that are described by the arguments. Each argument is a string of the form <node> or <node>.<pid> . This is equivalent to running the unpause_worker program with the same arguments.

Pausing And Unpausing The System Manually

To pause an individual worker, create a file named PAUSE in the worker's directory. For example, assuming a worker is running on node "rain" with process ID 17192, type:

cd $JJ_HOME/workers/rain.17192
touch PAUSE

You can pause all workers in a Job Jar system by creating a file named PAUSE_ALL in the JJ_WORKER_HOME directory. Type:

cd $JJ_WORKER_HOME
touch PAUSE_ALL

This is equivalent to creating a PAUSE file in each individual worker directory, but more convenient. It can also be used to start new workers in a paused state.

To resume normal operations, remove the PAUSE or PAUSE_ALL files.

Writing Jobs

A job may be any Unix executable program. In practice, it is most effective to write Unix shell scripts that launch other programs with whatever parameters they require, since jobs are run without any command-line parameters. The program bin/submit_job in a Job Jar installation can automate creation of a shell script for simple tasks. See the section Submitting Jobs for more information.

The Job Environment

When a job is run by a Job Jar worker, its environment will contain some useful symbols.

  • JJ_HOME. The absolute path to the parent directory of the job directories (unclaimed, claimed, and completed).
  • JJ_SCRATCH_DIR. The absolute path to a fresh directory which the job may use in any way it wishes. This is also the current working directory when the job is started. This directory, and all its contents, will be deleted after the job exits.
  • JJ_WORKER_PID. The process ID of the worker that started this job.
  • JJ_WORKER_DIR. The directory in which the worker that started this job keeps its logs and other administrative information.
  • PATH. The standard Unix search path. Its value will be "/bin:/usr/bin:/usr/local/bin" .

Remember that the job's scratch directory ($JJ_SCRATCH_DIR) is deleted after the job finishes. If your job produces output files that need to be saved, it must move them to their destination before exiting.

Writing Jobs Through a Front-End Program

A common use for the Job Jar system is parallelizing large computations across multiple nodes. One way to do this is to use a front end program which takes the parameters of the computation as input and creates multiple jobs to carry out the computation.

For example, suppose the program run_sr performs a portion of a parallelizable computation called SR, which is parameterized by filename, start index, and end index. A hypothetical front-end system might prompt the user for the number of workers and the input parameters for the computation, then create multiple output jobs and submit them to the Job Jar system. An input of filename /data/sr_input, start index 1, end index 100, and 10 total workers, might produce 10 jobs like:

run_sr /data/sr_input 1 10
run_sr /data/sr_input 11 20
...
run_sr /data/sr_input 91 100

The front-end program could save these jobs to files for the user to submit, or submit them directly to the Job Jar system. See the section Submitting Jobs for different ways to submit jobs to the system.

Jobs That Submit Other Jobs

An alternative to writing a front-end for creating jobs is to write a job which itself spawns other jobs, then exits. A job that submits other jobs is known as a seed job.

To continue the SR example from the last section, you could write a non-interactive program named distribute_sr (the seed job) which takes parameters for the overall computation, then submits jobs for the subtasks which make it up. Then a user could submit a job like:

distribute_sr -input=/data/sr_input -start=1 -end=100 -workers=10

The distribute_sr program would then submit 10 jobs like the following, and exit:

run_sr /data/sr_input 1 10
run_sr /data/sr_input 11 20
...
run_sr /data/sr_input 91 100

This approach is quite similar to the approach of writing a front-end program for job creation and submission. However, seed jobs can express some ideas more concisely.

A front-end program can also be used to submit a seed job. The most flexible design is to write a simple seed job whose command-line parameters define the sub-jobs the seed job wil create, along with a front-end program that generates seed jobs. This results in modular, domain-specific interface, which can be extended or invoked in different ways.

Multi-Stage Jobs

Complex computations often involve multiple stages. Later stages often cannot be started until earlier stages are complete. This section demonstrates one way to synchronize multiple-stage jobs.

Suppose that the SR computation is easily parallelized, but that the intermediate outputs of the various workers must be merged back into a single output file for use as input to a later computation. That is, the output of the various run_sr programs must be used as input to a program called merge_sr; however, merge_sr cannot run until all the run_sr instances have completed. One way to do this is to write a simple seed job which submits all the run_sr jobs, along with another job, monitor_sr that:

  1. Checks to see if all the run_sr jobs have finished
  2. Submits merge_sr if they have, or
  3. Resubmits itself to the Job Jar system if they have not

A worker may claim monitor_sr at any time. If the run_sr jobs have not all completed, the worker will simply resubmit monitor_sr and look for other work.

This same idea can be applied to computations with an arbitrary number of stages. Each stage will have a seed job that submits the job or jobs that make up the current stage, along with a monitor job, like monitor_sr, that checks whether the current stage has finished. When the current stage finishes, the monitor will submit the seed job for the next stage, which will behave in the same way. In a three-stage computation:

  1. Stage 1 seed job is submitted
  2. Stage 1 seed job submits computation job and monitor job
  3. Stage 1 monitor job eventually submits Stage 2 seed job
  4. Stage 2 seed job submits computation job and monitor job
  5. Stage 2 monitor job eventually submits Stage 3 job

The seed jobs allow you to encapsulate each stage of a computation as a single idea, simplifying maintenance and allowing you to run later stages without running earlier stages, if necessary.

The advantages of this approach are:

  • The initial job is expressed simply
  • There is no need for a separate daemon to monitor the output of the computation job; instead, any idle worker may check whether the first stage of the computation has finished
  • Idle systems will check for completion frequently, while loaded systems will work on other jobs first (when the monitor job resubmits itself, the new instance will go at the end of the job queue, after any other jobs)

The disadvantages are:

  • There may be a higher latency for a particular computation than could be achieved with other methods (such as a separate daemon process to monitor intermediate files)
  • Computations that involve many stages do not have a single job file that expresses the whole computation

Submitting Jobs

Conceptually, submitting a job means copying an executable file to the unclaimed jar. In practice, the system first copies the executable file to an intermediate directory, then moves it atomically to the unclaimed jar. This prevents a worker from claiming a partially-copied job. The intermediate directory is $JJ_HOME/limbo .

Job Priorities

Each job has an associated priority. Higher-priority jobs are claimed by workers before lower-priority jobs. A job's priority is determined when it is submitted.

You do not need to specify a priority for a job. However, if there is already a long queue of jobs in the system, and you need a new job to be run before the older jobs complete, you can submit a job with a higher priority than what is already in the queue.

Alternatively, if you are going to submit a large number of jobs for some task that is not particularly urgent, you can submit those jobs with a low priority to let other jobs run to completion first. Note, however, that in a busy system with new jobs constantly being submitted, it is possible that low-priority jobs may never be claimed.

The default priority is 0. Negative priorities are allowed. If two jobs have equal priority, the older of the two jobs is claimed first.

Job Categories

Each job may be assigned to a particular category when it is submitted. Jobs that are not assigned to any category are uncategorized jobs. You can use categories to ensure that certain jobs are only run on nodes with a particular operating system, performance profile, or other distinguishing feature.

Each Job Jar worker has an ordered list of categories from which it will claim jobs. All workers will claim uncategorized jobs if there are no jobs in the other categories they are checking, so you do not need to use categories if you do not wish to. Each Job Jar worker also implicitly checks for jobs in a category with the same name as the node on which the worker is running.

Because a worker's list of categories is checked in order, job priorities only have meaning within a single category. If a worker is checking categories FAST and LINUX in that order, then it will not claim any jobs in the LINUX category until there are no jobs in the FAST category, no matter what the priorities of jobs in the LINUX category are.

There is no central list of categories. Instead, categories are created as needed. Submitting a job with a given category will create that category if it does not already exist.

Job categories are implemented as subdirectories of the unclaimed jar.

Note: If you assign a job a category that no worker will check, that job will never be run.

Implicit Categories

In addition to whatever explicit categories a worker is assigned, each worker also checks a category with the same name as the node on which the worker is running. You can submit jobs to the implicit per-node category to ensure that they run on a specific node.

In general, broader categories are more useful than per-node categories. However, running a job on a specific node is the only way to accomplish certain tasks. For example, suppose that the node "arizona" is running three workers, and the job one of the workers is processing has hung. You can submit a job to kill the hung job to the "arizona" category, so that it will run on the node where the kill command can take effect.

Using The submit_job Program

The program bin/submit_job in a Job Jar installation allows you to submit jobs to the system from the Unix command line.

The submit_job program can automate creation of a shell script for simple tasks. This allows you to submit jobs consisting of a Unix program plus some command parameters without manually creating a new script just to run the job. Type:

$JJ_HOME/bin/submit_job /usr/local/bin/big_computation /data/infile.tmp 104.1 30

This will first create an executable file with a unique name, containing the lines:

#!/bin/sh -f
/usr/local/bin/big_computation /data/infile.tmp 104.1 30

It will then copy the new file to limbo. When the copy is complete, submit_job will move the new file into the unclaimed jar.

The submit_job program can also copy executable files directly with the -f (file) flag. Type:

$JJ_HOME/bin/submit_job -f /opt/local/bin/standalone_job

This will:

  1. Copy the file specified after -f to limbo under a new, unique, name
  2. Move the job from limbo to the unclaimed jar

No intermediate shell script is created.

The submit_job program accepts the following options in addition to -f:

-p (priority)

The submit_job program submits jobs with a priority of 0 by default. You can specify a numeric priority with the -p (priority) flag. To submit a job with a priority of 10, type:

$JJ_HOME/bin/submit_job -p 10 /usr/local/bin/big_computation /data/infile.tmp 104.1 30

or:

$JJ_HOME/bin/submit_job -p 10 -f /opt/local/bin/standalone_job

-c (category)

The submit_job program submits jobs as uncategorized by default. You can specify a category with the -c (category) flag. To submit a job in category LINUX, type:

$$JJ_HOME/bin/submit_job -c LINUX /usr/local/bin/big_computation /data/infile.tmp 104.1 30

-i (identifier)

The submit_job program copies jobs to a unique file name before submitting them to the system. This unique name includes information such as the date and time the job was submitted, and the login of the submitter. This can help find the job file as it goes through the Job Jar system.

You can specify an additional string identifier to be put in the job's name with the -i (identifier) flag. Specifying an additional identifier can help you keep track of the purpose of a particular job. Type:

$$JJ_HOME/bin/submit_job -i test_run /usr/local/bin/big_computation /data/infile.tmp 104.1 30  

NOTE: Unix shells split command-line arguments into separate words based on whitespace. The submit_job program joins these words back together again with single spaces. Unix shells also interpret quotation marks specially. If you require quotation marks to be preserved in the final job that is submitted, you may need to protect them from the Unix shell with backslashes or by enclosing the entire command line in single quotes. You can also create a script file with quotes as you want them, and use the -f option to submit that file directly to the Job Jar system.

Submitting a Job From The Python Client Library

The Python client library allows you to submit jobs directly from Python, by calling the following functions:

  • submit_job(commands, priority=0, category=None, id=None): Submits the given string to the Job Jar system under a unique name. Equivalent to using the submit_job program. The priority, category, and id arguments are optional.
  • submit_job_file(filename, priority=0, category=None, id=None): Submits the given filename (which should be an absolute path to a Unix executable) to the Job Jar system under a unique name. Equivalent to using the submit_job program with the -f flag. The priority, category, and id arguments are optional.

The following example demonstrates the use of the Python client library to submit three jobs to the unclaimed jar. It assumes that the path to the Python client library is in your Python search path:

import jobjar
jobjar.submit_job('/usr/local/bin/big_computation /data/infile.tmp 104.1 30')
# Submit this job at a low priority
jobjar.submit_job_file('/opt/local/bin/standalone_job', -3, id='test_run')
# Submit this job with category LINUX
jobjar.submit_job_file('/opt/local/bin/standalone', category='LINUX')

Submitting a Job Manually

To submit a job to a Job Jar system, simply place an executable file in the unclaimed jar. You should copy the job to limbo first, then move it from limbo to the unclaimed jar, so that a worker does not claim a half-copied job. If the job you want to submit is a program named /opt/local/bin/standalone_job, and you wish the job to be uncategorized, type:

cd $JJ_HOME/limbo
cp /opt/local/bin/standalone_job .
mv standalone_job ../unclaimed

This is similar to typing:

$JJ_HOME/bin/submit_job -f /opt/local/bin/standalone_job

If you wish to submit the same job to the LINUX category, type:

cd $JJ_HOME/limbo
cp /opt/local/bin/standalone_job .
# Assumes the LINUX directory already exists
mv standalone_job ../unclaimed/LINUX

This is similar to typing:

$JJ_HOME/bin/submit_job -f /opt/local/bin/standalone_job -c LINUX

The only difference between submitting a job manually and using the submit_job program is that the submit_job program would have copied the job executable to a new, unique name in limbo, and would have created the LINUX category if it did not exist.

If you wish to set the priority of the job to something other than the default of 0, put the priority you wish in the filename's extension. In the previous example, you might specify a priority of 15 as follows:

cd $JJ_HOME/limbo
cp /opt/local/bin/standalone_job ./standalone_job.15
mv standalone_job.15 ../unclaimed

This is similar to typing:

$JJ_HOME/bin/submit_job -p 15 -f /opt/local/bin/standalone_job

Using The unique_job_name Program

The program bin/unique_job_name in a Job Jar installation generates a unique name for a job and prints the unique name to standard output. The job name is an absolute path located in the Limbo directory. The first example in the previous section could be rewritten as:

jobname=`$JJ_HOME/bin/unique_job_name`
cp /opt/local/bin/standalone_job $jobname
mv $jobname $JJ_HOME/unclaimed
unset jobname

This is equivalent to typing:

$JJ_HOME/bin/submit_job -f /opt/local/bin/standalone_job

You can provide the -p option to unique_job_name to specify priority, just as with the submit_job program. The second example in the previous section could be rewritten as:

jobname=`$JJ_HOME/bin/unique_job_name -p 15`
cp /opt/local/bin/standalone_job $jobname
mv $jobname $JJ_HOME/unclaimed
unset jobname

This is equivalent to typing:

$JJ_HOME/bin/submit_job -p 15 -f /opt/local/bin/standalone_job

You can provide the -i option to unique_job_name to specify a string identifier to include in the job's name, just as with the submit_job program.

The intended purpose for the unique_job_name program is to allow you to write your own version of submit_job, which does any extra work needed in your problem domain. For example, if all of your jobs need a common preamble and postamble, you can write a program that automatically generates the preamble and postamble, so that you only need to specify the portion of each job that changes from job to job.

Monitoring a Running Job Jar System

Using The system_status Program

The program bin/system_status can be used to print a summary of the Job Jar system. Type:

cd $JJ_HOME/bin
./system_status

Monitoring a System From The Python Client Library

The Python client library contains functions for inspecting the status of the Job Jar system.

  • system_status(): Return a string summary of the Job Jar system.
  • system_status_info(): Return a dictionary containing detailed status information about the Job Jar system.

Monitoring a System Manually

When a worker starts, it creates a subdirectory in the workers directory of the Job Jar installation. This directory's name is of the form <node>.<pid> . For example, a worker running on node rain with process ID 7723 would create the directory $JJ_HOME/workers/rain.7723 . Thus, simply inspecting the contents of the workers subdirectory of a Job Jar installation can give you an overview of how many workers there are in the system, and on what nodes they're running.

Each worker's directory is further populated as follows:

  • logs: A subdirectory containing the standard output and standard error of every job the worker has run so far (including any job currently running). These files are named after the date and time the job was claimed.
  • history: A file containing messages from the worker itself, such as what time the worker started, start and end notifications for each job it claims, and other accounting information.
  • scratch: A directory for use by the currently-running job. This directory is re-created afresh before each job, and deleted after the job runs. If no job is running, this directory will not exist.
  • <job file>: The actual job file being run. When a worker claims a job, it moves the job file to the worker directory. The job name will differ from job to job.

Workers do not delete their directories when they exit.

Configuration

A Job Jar installation needs no special configuration to run. However, there are several aspects of a Job Jar system that can be configured if desired.

When Job Jar programs run, they determine their configuration parameters in this order:

  1. Default values of parameters
  2. Parameters set in configuration files on disk
  3. Parameters set in environment variables (which may point to alternate configuration files)
  4. Parameters set in the Python client library (only applicable for Python programs)

That is, parameters set in environment variables override parameters set in configuration files, and so on.

Common Configuration Parameters

The following parameters may be set in configuration files, environment variables, or in the Python client library.

JJ_PAUSE_TIME. If set, this parameter's value is taken as the base time, in seconds, for a worker to sleep when it is paused. A randomizing factor of plus or minus 10%, but no more than 10 seconds, is applied to this time to help spread out access to the shared directories from workers on different nodes. The default is 60 seconds (one minute).

JJ_MAX_WORKERS. If set, this parameter's value is taken as the maximum number of workers that should run on the current node. The program check_workers consults this parameter. This parameter may only be set in an environment variable or INI file, not from the Python client library. The default is 4 workers per node.

JJ_HOME. If set, this parameter's value is taken to be an absolute path. It is interpreted as the parent of job directories (unclaimed, claimed, and completed), the limbo directory, and worker directories (<node>.<pid>). The default location for job and worker directories is the top directory of the Job Jar installation hierarchy.

JJ_WORKER_HOME. If set, this parameter's value is taken to be an absolute path. It is interpreted as the parent of worker directories (<node>.<pid>). If both JJ_HOME and JJ_WORKER_HOME are set, then job directories are under the value of``JJ_HOME``, while workers directories are under the value of JJ_WORKER_HOME. The default location for worker directories is $JJ_HOME/workers.

JJ_CATEGORIES. If set, this parameter's value is taken to be an ordered list of the categories workers check for jobs. Workers check these categories after checking the implicit category for the node on which the worker is running. Workers always check for uncategorized jobs as well, but only after first checking for jobs in the implicit per-node category, and categories specified by JJ_CATEGORIES.

Configuring a System With INI Files

Job Jar system programs look for configuration parameters in a file named .jobjar/config.ini in your home directory. This file is called the personal configuration file. If the environment variable JJ_CONFIG_FILE is set, then its value is taken as an absolute path to a configuration file to use instead.

If the file config.ini exists in the Job Jar installation directory, configuration parameters are first read from that file. This file is called the global configuration file. Settings in the personal configuration file will override settings in the global configuration file, and settings in the environment will override both. The Job Jar distribution includes a global configuration file. You may copy this file to the personal configuration file to edit it, or remove it entirely, if you wish.

The structure of Job Jar configuration files is similar to that of Windows INI files. Lines beginning with "#" or ";" are ignored and may be used to provide comments. The first nonblank, non-comment line of a config file must be [Job Jar]. Subsequent lines contain "name: value" entries ("name=value" is also accepted). The names you may set are those described in the section Common Configuration Parameters .

The following is an example INI file which sets all of the common configuration parameters:

# Sample Job Jar configuration file
[Job Jar]
# Pause 5 minutes when no jobs are available
JJ_PAUSE_TIME: 300
# The check_workers program may start up to 4 workers on this node 
JJ_MAX_WORKERS: 4
# The parent directory for the unclaimed, claimed, completed, and
# limbo directories
JJ_HOME: /opt/jobjar-1.0.0
# The parent directory for worker-specific directories
JJ_WORKER_HOME: /opt/jobjar_workers
# The categories workers will check for jobs
JJ_CATEGORIES: LINUX

Configuring a System With Environment Variables

You may set environment variables to control a Job Jar system. Environment variables override settings in configuration files. All of the parameters in the section Common Configuration Parameters may be set in the environment. There is also one additional parameter that you may set: JJ_CONFIG_FILE.

JJ_CONFIG_FILE. If set, this variable's value is taken as the absolute path to the personal configuration file. See the section Configuring a System With INI Files for details. Any values set in the environment will override values set in this configuration file. For example, if the configuration file specifies JJ_HOME, and JJ_HOME is also set in the environment, the value from the environment is used.

Configuring The Python Client Library

When the Python client library interacts with a Job Jar system, it consults any INI files and environment variables for configuration. You may also set and inspect configuration parameters directly from Python. If configuration parameters are set in INI files or environment variables, and also set explicitly using the Python client library, the settings in the Python client library will override the settings in INI files and environment variables.

The configuration-related functions in the Python client library are:

  • set_home(path): Set the JJ_HOME configuration parameter to the given absolute path.
  • get_home(): Return the current effective value of the JJ_HOME configuration parameter.
  • set_worker_home(path): Set the JJ_WORKER_HOME configuration parameter to the given absolute path.
  • get_worker_home(): Return the current effective value of the JJ_HOME configuration parameter.
  • set_pause_time(seconds): Set the JJ_PAUSE_TIME configuration parameter.
  • get_pause_time(): Get the current effective value of the JJ_PAUSE_TIME configuration parameter.
  • set_categories(category1, ...): Set the JJ_CATEGORIES configuration parameter.
  • get_categories(): Get the current effective value of the JJ_CATEGORIES configuration parameter as a string, with categories separated by strings.

The following example demonstrates the use of the Python client library to configure the Job Jar system. It assumes that the path to the Python client library is in your Python search path:

import jobjar
jobjar.set_home('/opt/jobjar-1.0.0')
jobjar.set_worker_home('/opt/jobjar_workers')
jobjar.set_pause_time(500)

The Worker Lifecycle

It is useful to understand what each worker process does. The pseudocode below describes the lifecycle of a Job Jar worker:

read configuration parameters from INI files and environment variables
create worker directory
do forever:
  if stopped:
    exit
  if paused:
    sleep for JJ_PAUSE_TIME seconds, plus a small randomizer
    continue

  job_to_claim = next_job_to_claim()

  if no job available:
    sleep for JJ_PAUSE_TIME seconds, plus a small randomizer
    continue

  move job_to_claim to worker directory

  if move failed:
    # Another worker claimed the job first
    continue
  make symbolic link in claimed jar to newly-claimed job   
  make job scratch directory
  set environment variables for job
  run job
  move job to completed jar
  remove symbolic link from claimed jar
  remove job scratch directory


function next_job_to_claim ():
  for category in [per-node category] + JJ_CATEGORIES:
    candidates = list of all jobs with the highest priority
    if no candidates:
      continue
    else:
      return oldest job in candidates
  
  candidates = list of all uncategorized jobs with the highest priority
  return oldest job in candidates

The primary points to notice in the worker lifecycle are:


$Id: user_guide.txt,v 1.11 2004/07/13 01:30:04 sfiedler Exp $