DoC Computing Support Group


Revision 1 as of 2008-07-23 16:33:00

Clear message

Condor in the Department of Computing

What is Condor?

Condor is an open-source, high-throughput batch-processing system developed by the University of Wisconsin. It is currently deployed in a "cycle-stealing" mode on desktop Linux machines throughout the department, and in a "dedicated" mode on some cluster nodes. Condor is useful for running large numbers of similar jobs in a reliable and parallel fashion.

Utilization statistics for the main pool are available.

Who can use Condor?

Anyone with a CSG Linux account in the department, staff or student, may submit jobs to the Condor pool; no special permission need be obtained.

That said, it should only be used for academic work; if you would like to use the departmental machines for non-academic work (with Condor or otherwise) you should ask CSG for permission first.

How do I use Condor?

Condor can be driven using the command-line tools available on CSG Linux machines. The commands you will likely use most often are:

  • condor_submit, which allows you to submit jobs to the system;

  • condor_q, which allows you to query the state of jobs in the pool; and

  • condor_status, which allows you to query the state of machines in the pool.

You will need to add ${CONDOR_HOME}/bin to your PATH before using the Condor command-line tools, eg by running:

setenv PATH ${PATH}:${CONDOR_HOME}/bin

Manual pages for all of the Condor commands are available; access them using man command.

Job submission to Condor requires that you construct a job specification to define which program you want to run and how you would like the Condor system to run it. Job specifications in Condor are very powerful and there are a large number of different options that you can set. To help get you started, here is a simple example:

#############################
##
## Example Condor job specification
##
##############################

# This is a comment.

# This defines what job universe we want the job to run in.
# 'vanilla' is the simplest option for basic command execution.
# Other universes that exist include 'standard', 'grid', 'java' and 'mpi'.
universe        = vanilla

# This defines the path of the executable we want to run.
executable      = /bin/uname

# This specifies where data sent to STDOUT by the executable should be
# directed to.
#
# The Condor system can perform variable substitution in job specifications;
# the $(Process) string below will be replaced with the job's Process value.
# If we submit multiple jobs from this single specification (we do, as you 
# will see later) then the Process value will be incremented for each job.
# For example, if we submit 100 jobs, then each job will have a different 
# Process value, numbered in sequence from 0 to 99.
#
# If we were to instruct every job to redirect STDOUT to the same file, then
# data would be lost as each job would overwrite the same file in an 
# uncontrolled manner.  Thus, we direct STDOUT for each job to a uniquely 
# named file.
output          = uname.$(Process).out

# As above, but for STDERR.
error           = uname.$(Process).err

# Condor can write a time-ordered log of events related to this job-set
# to a file we specify.  This specifies where that file should be written.
log             = uname.log

# This specifies what commandline arguments should be passed to the executable.
arguments       = -n -m -p

# This specifies that the specification, as parsed up to this point, should be 
# submitted 5 times.  (If the number is omitted, the number '1' is assumed.)
queue 5

To run this job specification, save it to a file (say uname-5.cmd) and then pass it to the condor_submit command, ie:

[user@machine:~]% condor_submit uname-5.cmd
Submitting job(s).....
Logging submit event(s).....
5 job(s) submitted to cluster 1.

Once submitted, you can inspect the current status of your jobs by running condor_q:

[user@machine:~]% condor_q

-- Submitter: machine.doc.ic.ac.uk : <146.169.0.0:12345> : machine.doc.ic.a
c.uk
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
   1.0   user            2/1  11:52   0+00:00:00 I  0   0.0  uname -n -m -p
   1.1   user            2/1  11:52   0+00:00:00 I  0   0.0  uname -n -m -p
   1.2   user            2/1  11:52   0+00:00:00 I  0   0.0  uname -n -m -p
   1.3   user            2/1  11:52   0+00:00:00 I  0   0.0  uname -n -m -p
   1.4   user            2/1  11:52   0+00:00:00 I  0   0.0  uname -n -m -p

5 jobs; 5 idle, 0 running, 0 held

When your jobs terminate, you will be automatically emailed by the Condor system. Once all of your jobs have completed then you should find that the job STDOUT and STDERR have been logged as per the instructions in your specification:

[user@machine:~]% ls
uname.0.err  uname.1.err  uname.2.err  uname.3.err  uname.4.err  uname-5.cmd
uname.0.out  uname.1.out  uname.2.out  uname.3.out  uname.4.out  uname.log
[user@machine:~]% cat *.out
squidward i686 Intel(R) Pentium(R) 4 CPU 3.00GHz
squidward i686 Intel(R) Pentium(R) 4 CPU 3.00GHz
lexington i686 unknown
lexington i686 unknown
ming i686 unknown

(The strings returned by uname -n -m -p will naturally vary depending on which machines your jobs actually ran on.)

This is just a basic overview; the full canonical manual on the use (and administration) of Condor is available from the main Condor website.

Troubleshooting

Problem: I can't submit my job -- I get one of the following errors:

[username@host:~] condor_submit my_job.cmd
Submitting job(s)
ERROR: Failed to set Owner="username" for job 3.0

ERROR: Failed to queue job.

[username@host:~] condor_submit my_job.cmd
Submitting job(s)
ERROR: Failed to connect to local queue manager
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using KERBEROS

Explanation: The current Condor installation uses Kerberos to authenticate almost all Condor-related operations, including job submission. The above error message indicates that you don't have a (valid) kerberos ticket, and need to generate a new one.

Solution: Simply use kinit to generate a new kerberos ticket and try again.

Problem: My Standard universe Condor jobs keep crashing with SIGSEGV (Signal 11)!

Expalanation: Segmentation faults are caused by a process attempting to access a memory address that it doesn't have rights to access. This is a fatal error, and will immediately terminate that process.

Standard universe jobs are jobs which have been compiled with the condor_compile facility so that they will checkpoint their memory contents to stable storage when signalled. This will occur, for example, when a machine on which the job has been running is reclaimed by its owner - the job will be checkpointed and then terminated.

The checkpointed job will then be restarted on a (potentially completely different) machine when one becomes available - its memory state initialised with the contents of its last checkpoint image. However, if the new machine that the job is now running on is not running a compatible operating system image, then your application may try to access memory that it doesn't own, resulting in a segmentation fault.

The CSG Mandrake 9.1 and 10.2 and Ubuntu 7.04 software distributions are all sufficiently different to fall into this category; as a result, you need to ensure that your code will only run on one specific standard CSG distribution.

Solution: Extra ClassAds has been added to every machine in the DoC Condor pool, named DoC_OS_Distribution and DoC_OS_Release. The contents of these fields are populated using the output from lsb_release. For example, CSG Mandrake 10.2 machines will contain the following output:

DoC_OS_Distribution = "Mandrakelinux"
DoC_OS_Release = "10.2"

Similarly, CSG Ubuntu 7.04 machines currently contain the following string:

DoC_OS_Distribution = "Ubuntu"
DoC_OS_Release = "7.04"

Some CSG machines are now running 64-bit operating systems; these can be identified by the Arch field populated by Condor. 32bit machines will advertise themselves as the INTEL architecture (even if the machine is running on non-Intel chips!) while 64-bit machines will advertise themselves as X86_64.

You can force your job to only run on machines that define these fields in a particular way using an expression of the following form in your job description file:

## Example Condor command file

Universe        = standard
Executable      = foobar
Output          = foobar.out
Error           = foobar.err
Log             = foobar.log

Requirements    = DoC_OS_Distribution == "Ubuntu" && \
                  DoC_OS_Release == "7.04" && \
                  Arch == "INTEL"
        
Queue 1

Many other attributes are defined and populated for every execution host; try running condor_status $hostname -l to look at the current list for a host.

Mailing lists

There are two mailing lists relating to the use of Condor in the Department:

  • Condor-announce: This low-volume read-only list is used to distribute service announcements regarding the Departmental Condor facility. Advance warning regarding service downtime, software upgrades, and other factors impacting service availability will be distributed here.

  • Condor-discuss: This is a general discussion list for users (or prospective users) of the Departmental Condor facility.

Any regular user of the Condor system is advised to subscribe to both of these lists.

Help!

If you are having problems with Condor, please contact CSG as usual via:

... or one of the other standard methods.

If you need to contact the Condor maintainer directly for some other reason, you can email: