DoC Computing Support Group


http://www.doc.ic.ac.uk/csg-res/media/condor-logo-header.jpg

What is Condor?

Condor is an open-source, high-throughput batch-processing system developed by the University of Wisconsin. It is currently deployed in a "cycle-stealing" mode on desktop Linux machines throughout the department, and in a "dedicated" mode on some cluster nodes.

As currently deployed, Condor is useful for running large numbers of independent, CPU-intensive jobs in an automated, reliable fashion.

Utilization statistics for the main pool are available.

Who can use Condor?

Anyone with a CSG Linux account in the department, staff or student, may submit jobs to the Condor pool; no special permission need be obtained.

That said, it should only be used for academic work; if you would like to use the departmental machines for non-academic work (with Condor or otherwise) you should ask CSG for permission first.

How do I use Condor?

Condor can be driven using the command-line tools available on CSG Linux machines. The commands you will likely use most often are:

  • condor_submit, which allows you to submit jobs to the system;

  • condor_q, which allows you to query the state of jobs in the pool; and

  • condor_status, which allows you to query the state of machines in the pool.

You will need to add ${CONDOR_HOME}/bin to your PATH before using the Condor command-line tools, eg by running:

# If your shell is bash
export PATH=${PATH}:${CONDOR_HOME}/bin

# If your shell is (t)csh
setenv PATH ${PATH}:${CONDOR_HOME}/bin

Manual pages for all of the Condor commands are available; access them using man command.

Job submission to Condor requires that you construct a job specification to define which program you want to run and how you would like the Condor system to run it. Job specifications in Condor are very powerful and there are a large number of different options that you can set. To help get you started, here is a simple example:

#############################
##
## Example Condor job specification
##
##############################

# This is a comment.

# This defines what job universe we want the job to run in.
# 'vanilla' is the simplest option for basic command execution.
# Other universes that exist include 'standard', 'grid', 'java' and 'mpi'.
universe        = vanilla

# This defines the path of the executable we want to run.
executable      = /bin/uname

# This specifies where data sent to STDOUT by the executable should be
# directed to.
#
# The Condor system can perform variable substitution in job specifications;
# the $(Process) string below will be replaced with the job's Process value.
# If we submit multiple jobs from this single specification (we do, as you 
# will see later) then the Process value will be incremented for each job.
# For example, if we submit 100 jobs, then each job will have a different 
# Process value, numbered in sequence from 0 to 99.
#
# If we were to instruct every job to redirect STDOUT to the same file, then
# data would be lost as each job would overwrite the same file in an 
# uncontrolled manner.  Thus, we direct STDOUT for each job to a uniquely 
# named file.
output          = uname.$(Process).out

# As above, but for STDERR.
error           = uname.$(Process).err

# Condor can write a time-ordered log of events related to this job-set
# to a file we specify.  This specifies where that file should be written.
log             = uname.log

# This specifies what commandline arguments should be passed to the executable.
arguments       = -n -m -p

# This specifies that the specification, as parsed up to this point, should be 
# submitted 5 times.  (If the number is omitted, the number '1' is assumed.)
queue 5

To run this job specification, save it to a file (say uname-5.cmd) and then pass it to the condor_submit command, ie:

[user@machine:~]% condor_submit uname-5.cmd
Submitting job(s).....
Logging submit event(s).....
5 job(s) submitted to cluster 1.

Once submitted, you can inspect the current status of your jobs by running condor_q:

[user@machine:~]% condor_q

-- Submitter: machine.doc.ic.ac.uk : <146.169.0.0:12345> : machine.doc.ic.a
c.uk
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
   1.0   user            2/1  11:52   0+00:00:00 I  0   0.0  uname -n -m -p
   1.1   user            2/1  11:52   0+00:00:00 I  0   0.0  uname -n -m -p
   1.2   user            2/1  11:52   0+00:00:00 I  0   0.0  uname -n -m -p
   1.3   user            2/1  11:52   0+00:00:00 I  0   0.0  uname -n -m -p
   1.4   user            2/1  11:52   0+00:00:00 I  0   0.0  uname -n -m -p

5 jobs; 5 idle, 0 running, 0 held

When your jobs terminate, you will be automatically emailed by the Condor system. Once all of your jobs have completed then you should find that the job STDOUT and STDERR have been logged as per the instructions in your specification:

[user@machine:~]% ls
uname.0.err  uname.1.err  uname.2.err  uname.3.err  uname.4.err  uname-5.cmd
uname.0.out  uname.1.out  uname.2.out  uname.3.out  uname.4.out  uname.log
[user@machine:~]% cat *.out
squidward i686 Intel(R) Pentium(R) 4 CPU 3.00GHz
squidward i686 Intel(R) Pentium(R) 4 CPU 3.00GHz
lexington i686 unknown
lexington i686 unknown
ming i686 unknown

(The strings returned by uname -n -m -p will naturally vary depending on which machines your jobs actually ran on.)

This is just a basic overview; the full canonical manual on the use (and administration) of Condor is available from the main Condor website.

Locally-defined ClassAds (DEPRECATED)

Note As of February 2013, the Locally-defined ClassAds have been removed. Do not use them.

The following extra, non-standard ClassAds are populated on execution hosts in the DoC Condor pool, and can be identified by a DoC_ prefix:

ClassAd Name

Definition

Example value

DoC_CPU#_ModelString

These show the Model, Mhz, CacheSize and Flags fields for CPU number #, as reported by /proc/cpuinfo.

Intel(R) Pentium(R) 4 CPU 3.00GHz

DoC_CPU#_Mhz

2992.890

DoC_CPU#_CacheSizeString

512 KB

DoC_CPU#_FlagsString

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr

DoC_Host_Classes

This shows a comma-seperated list of the CSG-internal host-classes that this execution host belongs to, in non-ascending order of significance.

OSCLIENTLINUXONLY, HOSTDOC, LINUXLAB, CLIENT, LAB, VERTEX, vertex02

DoC_OS_Distribution

These show the distributor ID and release number for the local operating system, as reported by the utility lsb-release -d -r.

Ubuntu

DoC_OS_Release

7.04

DoC_Package_Version_name

For a small number of selected packages -- for example, gcc-4.1, these show the version number for that package. Note that any hyphens in a package name will be converted to underscores to meet ClassAd naming conventions. If a given package is not installed, a ClassAd for that package will not appear.

4.1.2-0ubuntu4

DoC_Priority_Users

Contains a space-seperated list of users whose jobs will receive priority access to this host.

Many other attributes are defined and populated for every execution host; try running condor_status $hostname -l to look at the current list for a host. Documentation for these attributes can be found in the Condor Manual.

Frequently Answered Questions

Q. How do I limit the set of execution machines that my job will be dispatched to?

A. You can set a boolean Requirements ClassAd expression in your job specification file to limit where you job will be dispatched to. If the Requirements expression evaluates to FALSE for a given host, your job will not be sent to run there.

For example, you can use the following expression to specify that your job should only run on 2016 DoC Lab computers:

Requirements = regexp("^(line|matrix|point|ray|voxel)[0-9]{2}", TARGET.Machine)  

Troubleshooting

Problem: When I try to submit my job, I get this error:

[username@host:~] condor_submit my_job.cmd
Submitting job(s)
ERROR: Failed to connect to local queue manager
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using KERBEROS

Explanation: The current Condor installation uses Kerberos to authenticate almost all Condor-related operations, including job submission. The above error message indicates that you don't have a (valid) Kerberos ticket, and need to generate a new one.

Solution: Simply use kinit to generate a new Kerberos ticket and try again:

kinit username@IC.AC.UK
# Or if you have a legacy DoC Kerberos principal
kinit username@DOC.IC.AC.UK

Problem: My Standard universe Condor jobs keep crashing with a SIGSEGV (Signal 11)!

Explanation: Segmentation faults are caused by a process attempting to access a memory address that it doesn't have rights to access. This is a fatal error, and will immediately terminate that process.

Standard universe jobs are jobs which have been compiled with the condor_compile facility so that they will checkpoint their memory contents to stable storage when signaled. This will occur, for example, when a machine on which the job has been running is reclaimed by its owner - the job will be checkpointed and then terminated.

The checkpointed job will then be restarted on a (potentially completely different) machine when one becomes available - its memory state initialised with the contents of its last checkpoint image. However, if the new machine that the job is now running on is not running a compatible operating system image, then your application may try to access memory that it doesn't own, resulting in a segmentation fault.

The CSG Mandrake 9.1 and 10.2 and Ubuntu 7.04 software distributions are all sufficiently different to fall into this category; as a result, you need to ensure that your code will only run on one specific standard CSG distribution.

Solution: Extra ClassAds has been added to every machine in the DoC Condor pool, named DoC_OS_Distribution and DoC_OS_Release. The contents of these fields are populated using the output from lsb_release. For example, CSG Mandrake 10.2 machines will contain the following output:

DoC_OS_Distribution = "Mandrakelinux"
DoC_OS_Release = "10.2"

Similarly, CSG Ubuntu 7.04 machines currently contain the following string:

DoC_OS_Distribution = "Ubuntu"
DoC_OS_Release = "7.04"

Some CSG machines are now running 64-bit operating systems; these can be identified by the Arch field populated by Condor. 32-bit machines will advertise themselves as the INTEL architecture (even if the machine is running on a non-Intel chipset!) while 64-bit machines will advertise themselves as X86_64.

Many other attributes are defined and populated for every execution host; try running condor_status $hostname -l to look at the current list for a host.

Mailing lists

There are two mailing lists relating to the use of Condor in the Department:

  • Condor-announce: This low-volume read-only list is used to distribute service announcements regarding the Departmental Condor facility. Advance warning regarding service downtime, software upgrades, and other factors impacting service availability will be distributed here.

  • Condor-discuss: This is a general discussion list for users (or prospective users) of the Departmental Condor facility.

Any regular user of the Condor system is advised to subscribe to both of these lists.

Help!

If you are having problems with Condor, please contact CSG as usual via:

 
 

services/hpc/condor (last edited 2017-01-20 21:55:36 by ldk)