有时候我们会遇到一些幽灵进程,这些进程占用了显卡资源,但实际上并未发挥作用。
例如这样:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
| Fri May 30 11:22:31 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla P100-SXM2-16GB Off | 00000000:1A:00.0 Off | 0 |
| N/A 38C P0 43W / 300W | 8696MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2-16GB Off | 00000000:1B:00.0 Off | 0 |
| N/A 41C P0 42W / 300W | 9408MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla P100-SXM2-16GB Off | 00000000:3D:00.0 Off | 0 |
| N/A 39C P0 44W / 300W | 4288MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla P100-SXM2-16GB Off | 00000000:3E:00.0 Off | 0 |
| N/A 39C P0 43W / 300W | 404MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 Tesla P100-SXM2-16GB Off | 00000000:88:00.0 Off | 0 |
| N/A 42C P0 43W / 300W | 13248MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 Tesla P100-SXM2-16GB Off | 00000000:89:00.0 Off | 0 |
| N/A 45C P0 49W / 300W | 9450MiB / 16384MiB | 8% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 Tesla P100-SXM2-16GB Off | 00000000:B1:00.0 Off | 0 |
| N/A 42C P0 40W / 300W | 9200MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 Tesla P100-SXM2-16GB Off | 00000000:B2:00.0 Off | 0 |
| N/A 43C P0 41W / 300W | 5350MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 1 N/A N/A 65559 C python 9406MiB |
| 3 N/A N/A 85647 C ...Polyspace/R2020a/bin/glnxa64/MATLAB 400MiB |
| 5 N/A N/A 20759 C rddfg_cent_rw-StarCraft2-debug@syc 9448MiB |
+---------------------------------------------------------------------------------------+
|
明确占用显存的就三个进程,但是其他的显存都被占用了。
解决方案
我们保存并运行如下的脚本:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| #!/bin/bash
# nvidia-smi 显示的合法 PID
# 以下 PID 为合法的示例
KEEP_PIDS=("65559" "85647" "20759")
# 所有 GPU 相关的进程(不通过 nvidia-smi 显示)
ALL_PIDS=$(fuser /dev/nvidia* 2>/dev/null | tr ' ' '\n' | sort -u)
for pid in $ALL_PIDS; do
if [[ ! " ${KEEP_PIDS[@]} " =~ " ${pid} " ]]; then
echo "Killing ghost process PID $pid"
sudo kill -9 $pid
else
echo "Keeping important process PID $pid"
fi
done
|