Submitting to specific nodes:
sbatch –exclude node[001-008] submit.GPU.thor.sh
sbatch –nodelist node010 submit.GPU.thor.sh
Holding jobs in the queue:
This allows you to have jobs in the queue that won’t run even there are resources available. Usually to let others go ahead of you.
scontrol hold $jobID (get the id from doing squeue and looking at leftmost column) and then: scontrol release $jobID
Checking status of nodes
sinfo (“mix” is working, “idle” is waiting, both “down” and “drain” are bad):
[oliver@thoreau 0.4.0_glycoproteinLys.pdb]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST defq* up infinite 5 down* node[003-007] defq* up infinite 1 drain node001 defq* up infinite 3 mix node[002,009-010] mdaas up infinite 2 idle node[008,011]
Checking GPU status:
ssh node001 nvidia-smi oliver@node009 ~]$ nvidia-smi Mon May 13 01:06:36 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Quadro RTX 5000 On | 00000000:3B:00.0 Off | Off | | 38% 64C P2 187W / 230W | 754MiB / 16125MiB | 97% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Quadro RTX 5000 On | 00000000:5E:00.0 Off | Off | | 33% 38C P2 64W / 230W | 466MiB / 16125MiB | 24% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Quadro RTX 5000 On | 00000000:AF:00.0 Off | Off | | 33% 23C P8 7W / 230W | 0MiB / 16125MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Quadro RTX 5000 On | 00000000:D8:00.0 Off | Off | | 33% 23C P8 7W / 230W | 0MiB / 16125MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 289931 C ...ps/amber20/bin/pmemd.cuda 751MiB | | 1 N/A N/A 293344 C ...ps/amber20/bin/pmemd.cuda 463MiB | +-----------------------------------------------------------------------------+ GPU-Util will tell you how much work it's doing. The processes should all be on separate GPUs i.e. 0 and 1, not 0 and 0.