Bright Cluster Manager is a comprehensive cluster management solution for managing all types of HPC clusters and server farms, including CPU and GPU clusters, storage and database clusters, and big data Hadoop clusters. Slurm Workload Manager, which is integrated in Bright Cluster Manager, is an open source resource manager with a plug-in architecture, used in many large installations. It includes both queuing and scheduling functionality. The following blog post will provide basic Slurm Usage for Linux Clusters.

 

Creating a Slurm (memtester) job script

$ cat memtesterScript.sh
#!/bin/bash

/cm/shared/apps/memtester/current/memtester 24G

Submitting the job to N nodes

$ module load slurm
$ sbatch –array=1-50 ~/memtesterScript.sh
Submitted batch job 120

Listing the job

$ squeue

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

120_[7-50]     defq memteste      exx PD     0:00      1 (Resources)

  120_1      defq memteste      exx  R       1:42       1 node001

  120_2      defq memteste      exx  R       1:42      1 node002

  120_3      defq memteste      exx  R       1:42      1 node003

  120_4      defq memteste      exx  R       1:42      1 node004

  120_5      defq memteste      exx  R       1:42      1 node005

  120_6      defq memteste      exx  R       1:42      1 node006

 

Getting job details

$ scontrol show job 121

JobId=121 ArrayJobId=120 ArrayTaskId=1 JobName=memtesterScript.sh

UserId=exx(1002) GroupId=exx(1002)

Priority=4294901753 Nice=0 Account=(null) QOS=normal

JobState=RUNNING Reason=None Dependency=(null)

Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

RunTime=00:05:01 TimeLimit=UNLIMITED TimeMin=N/A

SubmitTime=2015-08-14T00:13:20 EligibleTime=2015-08-14T00:13:21

StartTime=2015-08-14T00:13:21 EndTime=Unknown

PreemptTime=None SuspendTime=None SecsPreSuspend=0

Partition=defq AllocNode:Sid=bright71:17752

ReqNodeList=(null) ExcNodeList=(null)

NodeList=node001

BatchHost=node001

NumNodes=1 NumCPUs=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0

Features=(null) Gres=(null) Reservation=(null)

Shared=0 Contiguous=0 Licenses=(null) Network=(null)

Command=/home/exx/memtesterScript.sh

WorkDir=/home/exx

StdErr=/home/exx/slurm-120_1.out

StdIn=/dev/null

StdOut=/home/exx/slurm-120_1.out

 

Suspending a job*

# scontrol suspend 125
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
125  defq memteste  exx  S   0:13   1    node01

Resuming a job*

# scontrol resume 125
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
125  defq memteste  exx  R   0:13   1    node01

Killing a job**

$ scancel 125
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

 

*Root Only
**Users can kill their own jobs, root can kill any job