All of your data and code is available on all nodes.
If you want to follow along with any of the examples, you can find them here -
1-basicSlurm.sh
1-basicSlurm.sh
You can run this script on a compute node using;
slurm-<job id>.stats
+--------------------------------------------------------------------------+
| Job on the Baskerville cluster:
| Starting at Tue Jul 25 11:32:17 2023 for allsoppj(836257)
| Identity jobid 474749 jobname 1-basicSlurm.sh
| Running against project edmondac-rsg and in partition baskerville-a100_40
| Requested cpu=2,mem=6G,node=1,billing=2 - 01:00:00 walltime
| Assigned to nodes bask-pg0308u24a
| Command /bask/homes/a/allsoppj/BaskervilleRemoteDropIn/BasicSlurmFile/1-basicSlurm.sh
| WorkDir /bask/homes/a/allsoppj/BaskervilleRemoteDropIn/BasicSlurmFile
+--------------------------------------------------------------------------+
+--------------------------------------------------------------------------+
| Finished at Tue Jul 25 11:32:37 2023 for allsoppj(836257) on the Baskerville Cluster
| Required (00:00.689 cputime, 4232K memory used) - 00:00:20 walltime
| JobState COMPLETING - Reason None
| Exitcode 0:0
+--------------------------------------------------------------------------+
For example
[allsoppj@bask-pg0310u18a BasicSlurmFile]$ squeue -j 474735
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
474735 baskervil 1-basicSlu allsoppj R 0:09 1 bask-pg0308u24a
which persists for a couple of minutes after the job has finished.
Code | ||
---|---|---|
PD | Pending | All good - waiting for resources before starting |
R | Running | All good - working away |
CG | Completing | All good - finished but some processes still working |
C | Completed | All good - job successfully finished |
F | Failed |
Jobs can be stopped at any time using
Note the lack of a “-j” for this.
Store job id in a bash variable directly using:
Job name is used throughout slurm, so change it to something more readable than the script name:
#SBATCH --job-name "A_More_Readable_Name"
[allsoppj@bask-pg0310u18a BasicSlurmFile]$ cat slurm-474832.out
This script is running on bask-pg0308u24a.cluster.baskerville.ac.uk
[allsoppj@bask-pg0310u18a BasicSlurmFile]$ cat slurm-474832.stats
Job on the Baskerville cluster:
Starting at Tue Jul 25 17:37:51 2023 for allsoppj(836257)
Identity jobid 474832 jobname AMoreReadableName
Running against project edmondac-rsg and in partition baskerville-a100_40
Requested cpu=2,mem=6G,node=1,billing=2 - 01:00:00 walltime
Assigned to nodes bask-pg0308u24a
Command /bask/homes/a/allsoppj/BaskervilleRemoteDropIn/BasicSlurmFile/4-changeName.sh
WorkDir /bask/homes/a/allsoppj/BaskervilleRemoteDropIn/BasicSlurmFile
Finished at Tue Jul 25 17:38:11 2023 for allsoppj(836257) on the Baskerville Cluster
Required (00:00.701 cputime, 4236K memory used) - 00:00:19 walltime
JobState COMPLETING - Reason None
Exitcode 0:0
Default option for this is
slurm-%j.out
Full list of options at https://doc.hpc.iter.es/slurm/how_to_slurm_filenamepatterns
Don’t try
#SBATCH --output $(pwd)/outputfiles/%A_%a.out
Use a template slurm file and substitute values into a new script per job with the sed bash command .
Baskerville has two types of GPU,
Documented in more detail in the docs.baskerville.ac.uk
See 6-MoreResources to show loading PyTorch and difference between selecting 1 and more GPUs.
Adding this will run the script 10 times with 2 jobs running simultaneously
Need to use with environment variables to make it useful.
To track these jobs use the sacct -j <Job id> command, or the squeue command.
See 2-arrayJobConfig.sh for information on loading a different config in each array job.
Used this approach to run nearly 700,000 jobs in blocks of 4000, 500 at a time.
Using Slurm