Body
How SLURM Priority Works
When you submit a job to a partition using SLURM, your job is assigned a priority value, that is a weighted sum of multiple factors. From the official SLURM documentation, it will look something like:
Job_priority =
site_factor +
(PriorityWeightAge) * (age_factor) +
(PriorityWeightAssoc) * (assoc_factor) +
(PriorityWeightFairshare) * (fair-share_factor) +
(PriorityWeightJobSize) * (job_size_factor) +
(PriorityWeightPartition) * (priority_job_factor) +
(PriorityWeightQOS) * (QOS_factor) +
SUM(TRES_weight_cpu * TRES_factor_cpu,
TRES_weight_<type> * TRES_factor_<type>,
...)
- nice_factor
The job with the highest priority in a queue will be the next to start. There are three main things that increase your job's priority:
- Age - Your job will build priority points based on its age in the current queue.
- Size - Your job will move through the queue faster if you request only the resources you need. Over-requesting resources will reduce your job's priority.
- Fairshare - As you run jobs, your fairshare decreases. It takes 28 days to fully reset your fairshare to the maximum value. In this way, you can view fairshare as a priority penalty based on a rolling window of your past 4 weeks of usage. Avoid over-requesting resources to preserve your fairshare.
The priority factors for jobs can be checked using the sprio command. Below is priority of all jobs in the gpu partition. In this example, the single biggest factor is the fairshare value.

How to Check the Queue
The GPU partition is often in high demand. Here's how to check what's going on so you can make an informed decision when submitting jobs:
squeue -p gpu -O "JobID,State,TimeUsed,TimeLimit,NumNodes,ReasonList,Gres,Priority" to check the status of the gpu partition.

The priority value has been normalized to lie between 0 and 1 but remains consistent otherwise with the output of sprio before. The "TRES_PER_NODE" column shows what GPU each job is requesting. From this, we can see that there are many jobs that are requesting H100 GPUs, and the "TIME_LIMIT" column indicates that most of the requests are for 2 days of runtime. From this, it is clear that submitting another job requesting an H100 will have significant competition in the queue.
We can also get more information on the state of the nodes in the gpu partition, as well as the ratio of used:available gpus using the command:
sinfo -O "Nodelist,Gres,GresUsed" -p gpu
Which will output:

We can see that each V100 box has 4 GPUs and the H100 box has 8 GPUs. Of those, 8 H100 GPUs and 7 V100 GPUs are in use. This gives 100% of the H100s and 53% of the V100s in use.
This explains the queue seen in the previous command; there are multiple H100 jobs in the queue waiting for an H100 to become available.
But There's an Available GPU!
We can see that there are available V100 GPUs, as well as a job in the queue that want a V100. Why is that job not allowed to start?
The answer lies in the other resources: CPUs and Memory. SLURM can only start a job if there is sufficient resources of all types requested by a job.
Particularly, this is often an issue with memory, as users will request far more than their job needs (in fear of running out) such that the node runs out of memory before any other resource.
Let's check the state of the gpu queue with respect to the CPUs and Memory available. We can modify the sinfo command used before to include them:
sinfo -O "Nodelist,Gres:35,GresUsed:45,CPUsState:15,Memory:10,AllocMem" -p gpu
Which outputs:

The CPUs fields stand for Active/Idle/Other/Total and the MEMORY field represents the total memory on the node, while ALLOCMEM represents the memory that is allocated to running jobs.
If we take the bottom V100 node as an example, we can see that while only 2 of the 4 V100 GPUs are in use, 40/40 CPU cores are in use. This means that no job can start to use the 2 idle V100s because there are no CPUs to service those jobs.
This is why it is important to request only the resources that you intend to use. Inefficient requests for one resource can end up locking accesses to another resource through creating impossible resource allocation conditions.
When Will My Job Start?
Depending on your job's position in the queue, SLURM may have an estimated start time. Due to the way the complexity of the schedule optimization problem grows, SLURM only predicts start times for the next few jobs in the queue and others are given a prediction of "N/A".
You can check the start time of a job using scontrol
scontrol show job ###
Where ### is your job ID. The output will show information about your job, and one field will show the estimated start time, if one has been calculated for your job.
