Kill Ghost Process

杀死占用显卡显存的幽灵进程

有时候我们会遇到一些幽灵进程,这些进程占用了显卡资源,但实际上并未发挥作用。

例如这样:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Fri May 30 11:22:31 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P100-SXM2-16GB           Off | 00000000:1A:00.0 Off |                    0 |
| N/A   38C    P0              43W / 300W |   8696MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2-16GB           Off | 00000000:1B:00.0 Off |                    0 |
| N/A   41C    P0              42W / 300W |   9408MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla P100-SXM2-16GB           Off | 00000000:3D:00.0 Off |                    0 |
| N/A   39C    P0              44W / 300W |   4288MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla P100-SXM2-16GB           Off | 00000000:3E:00.0 Off |                    0 |
| N/A   39C    P0              43W / 300W |    404MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  Tesla P100-SXM2-16GB           Off | 00000000:88:00.0 Off |                    0 |
| N/A   42C    P0              43W / 300W |  13248MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  Tesla P100-SXM2-16GB           Off | 00000000:89:00.0 Off |                    0 |
| N/A   45C    P0              49W / 300W |   9450MiB / 16384MiB |      8%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  Tesla P100-SXM2-16GB           Off | 00000000:B1:00.0 Off |                    0 |
| N/A   42C    P0              40W / 300W |   9200MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  Tesla P100-SXM2-16GB           Off | 00000000:B2:00.0 Off |                    0 |
| N/A   43C    P0              41W / 300W |   5350MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    1   N/A  N/A     65559      C   python                                     9406MiB |
|    3   N/A  N/A     85647      C   ...Polyspace/R2020a/bin/glnxa64/MATLAB      400MiB |
|    5   N/A  N/A     20759      C   rddfg_cent_rw-StarCraft2-debug@syc         9448MiB |
+---------------------------------------------------------------------------------------+

明确占用显存的就三个进程,但是其他的显存都被占用了。

解决方案

我们保存并运行如下的脚本:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#!/bin/bash

# nvidia-smi 显示的合法 PID
# 以下 PID 为合法的示例
KEEP_PIDS=("65559" "85647" "20759")

# 所有 GPU 相关的进程(不通过 nvidia-smi 显示)
ALL_PIDS=$(fuser /dev/nvidia* 2>/dev/null | tr ' ' '\n' | sort -u)

for pid in $ALL_PIDS; do
  if [[ ! " ${KEEP_PIDS[@]} " =~ " ${pid} " ]]; then
    echo "Killing ghost process PID $pid"
    sudo kill -9 $pid
  else
    echo "Keeping important process PID $pid"
  fi
done
Licensed under CC BY-NC-SA 4.0
comments powered by Disqus
使用 Hugo 构建
主题 StackJimmy 设计