Difference between revisions of "Condor"

From SourceWiki
Jump to navigation Jump to search
Line 60: Line 60:
 
=Submitting a simple script job=
 
=Submitting a simple script job=
  
OK, so much for other people's jobs.  Let's submit one ourselves.
+
OK, so much for other people's jobs.  Let's submit some ourselves. I have prepared some examples to make this as easy as possible.  To get these, you can cut & paste the following onto your command line (I assume linux here, but the same files will work for a windows submission host too):
  
Note that our submission host,  '''condor.ggy.bris.ac.uk''', is a '''Linux''' machine.
+
<pre>
 +
svn co http://source.ggy.bris.ac.uk/subversion-open/condor/trunk ./condor
 +
cd condor/examples/example1
 +
</pre>
  
 +
Without further ado, let's submit our first job to the pool:
 +
 +
<pre>
 +
condor_submit win.submit
 +
</pre>
 +
 +
If you look inside '''win.submit''', you will see that it is a short and reasonably self explanatory file:
 +
 +
<pre>
 +
Universe  = vanilla
 +
Notification = never
 +
requirements = OpSys == "WINNT51" && Arch == "INTEL"
 +
Output = test.out
 +
Error = test.err
 +
Log = test.log
 +
should_transfer_files = YES
 +
when_to_transfer_output = ON_EXIT_OR_EVICT
 +
Executable = test.bat
 +
Queue
 +
</pre>
 +
 +
The key lines to note for the moment are:
 +
 +
* the output of the job will collect in '''test.out'''
 +
* any errors will go to '''test.err'''
 +
* '''test.log''' will record the mechanics of sending the job to a remote PC
 +
* and that '''test.bat''' calls the shots!
 +
* the '''Queue''' keyword causes the job to enter the queue of jobs to be run 
 +
 +
The executable file '''test.bat''' is a short and simple batch file (aka 'shell script' in Linux-speak):
 +
 +
<pre>
 +
echo Hello from a Condor batch file running on:
 +
hostname
 +
echo The date is:
 +
date /T
 +
echo and the time is:
 +
time /T
 +
</pre>
 +
 +
The upshot of all the electrickery is the contents of '''test.out'''.  All being well, this should have been created by now (if not, consult condor_status and condor_q for more information on the state-of-play):
  
 
<pre>
 
<pre>
Line 73: Line 117:
 
15:36
 
15:36
 
</pre>
 
</pre>
 +
 +
'''Congratulations!  You've run your first condor job'''  That's the hardest part out of the way.  All we have now are a few more details.  So far so good?  OK, let's move on to the next example.
  
 
=Submitting an executable which you have compiled=
 
=Submitting an executable which you have compiled=

Revision as of 14:20, 22 April 2009

Condor: Making best use of the computers in the teaching labs

Introduction

The Condor Project enables us to run batch jobs on the pool of desktop computers around the department, that would otherwise be standing idle. Condor is particularly useful for high 'throughput' computing, such as an ensemble of independent model simulations, used to evaluate explore parameter-space.

Basic commands

You can run the following commands from the submission host of your condor pool. For Geography, this is condor.ggy.bris.ac.uk (Note that this is a Linux server).

You can review the status of all the machines in the 'condor pool' using the command condor_status:

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@GEOG-B224.gg WINNT51    INTEL  Owner     Idle     0.000  1661  0+01:30:04
slot2@GEOG-B224.gg WINNT51    INTEL  Owner     Idle     0.010  1661  0+01:30:05
slot1@geog-a105.gg WINNT51    INTEL  Claimed   Busy     1.130  1661  0+01:02:57
slot2@geog-a105.gg WINNT51    INTEL  Claimed   Busy     1.130  1661  0+01:02:58
slot1@geog-c200.gg WINNT51    INTEL  Unclaimed Idle     0.000  1662  0+00:00:04
slot2@geog-c200.gg WINNT51    INTEL  Unclaimed Idle     0.040  1662  0+00:00:00
...
...
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

       INTEL/WINNT51   191    78     109         4       0          0        0

               Total   191    78     109         4       0          0        0

Typically you will get several screen's full of output, so I've chopped out the middle part of the listing, leaving just a few at the top and the final summary, given at the end.

From the listing, you can see that:

  • The PC called GEOG-B224 has someone logged into it, indicated by the keyword Owner, but it is not working hard, as it is Idle.
  • In contrast, geog-a105 is marked as Claimed, indicating that it has been grabbed by condor, and it working hard, Busy.
  • The third possible state is exemplified by geog-c200, which is neither claimed nor in interactive use.

The final summary tells us that at the time of writing, the pool contains 191 PCs, 78 of which has a user logged in, 109 are claimed by condor and 4 remain unclaimed.

Another view of the state-of-play is given by condor_q:

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
 51.0   ggdagw          4/22 13:21   0+01:30:07 R  0   3.7  test.bat          
...
...
 51.15  ggdagw          4/22 13:21   0+01:33:46 I  0   3.4  test.bat   
...
...
150 jobs; 44 idle, 106 running, 0 held   

Here we see that:

  • job 51.0 was submitted at 13:21 on the 22nd of April, has been running for just over an hour and a half, and that the job executable is called 'test.bat'
  • We can also see that job 51.15 is idling, rather than running.
  • In total 106 of the 150 jobs submitted to condor are running, and accordingly 44 are still waiting to run and so are idle.

Submitting a simple script job

OK, so much for other people's jobs. Let's submit some ourselves. I have prepared some examples to make this as easy as possible. To get these, you can cut & paste the following onto your command line (I assume linux here, but the same files will work for a windows submission host too):

svn co http://source.ggy.bris.ac.uk/subversion-open/condor/trunk ./condor
cd condor/examples/example1

Without further ado, let's submit our first job to the pool:

condor_submit win.submit

If you look inside win.submit, you will see that it is a short and reasonably self explanatory file:

Universe   = vanilla
Notification = never
requirements = OpSys == "WINNT51" && Arch == "INTEL"
Output = test.out
Error = test.err
Log = test.log
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
Executable = test.bat
Queue

The key lines to note for the moment are:

  • the output of the job will collect in test.out
  • any errors will go to test.err
  • test.log will record the mechanics of sending the job to a remote PC
  • and that test.bat calls the shots!
  • the Queue keyword causes the job to enter the queue of jobs to be run

The executable file test.bat is a short and simple batch file (aka 'shell script' in Linux-speak):

echo Hello from a Condor batch file running on:
hostname
echo The date is:
date /T
echo and the time is:
time /T

The upshot of all the electrickery is the contents of test.out. All being well, this should have been created by now (if not, consult condor_status and condor_q for more information on the state-of-play):

Hello from a Condor batch file running on:
geog-c211
The date is:
22/04/2009 
and the time is:
15:36

Congratulations! You've run your first condor job That's the hardest part out of the way. All we have now are a few more details. So far so good? OK, let's move on to the next example.

Submitting an executable which you have compiled

Running an ensemble of jobs

Condor and energy saving measures