跳到主要内容

算盘GPU配置【2021】

>>驱动下载地址:

官方:https://www.nvidia.com/en-us/drivers/unix/ 推荐:https://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/XFree86/Linux-x86_64/460.80/NVIDIA-Linux-x86_64-460.80.run&lang=us&type=TITAN

>>添加K8S节点GPU标识

kubectl label nodes xxx suanpan.xuelangyun.com/gpu=available

一、安装依赖和驱动:

a)如果启动了gdm则需要先停止

service gdm stop 2>null

b)安装gcc、dkms 以及对应内核版本的kernel-devel

yum -y install gcc dkms kernel-devel "kernel-devel-uname-r == $(uname -r)"

c)安装NVIDIA显卡驱动

sudo bash NVIDIA-Linux-x86_64-460.80.run

安装日志存储路径:/var/log/nvidia-installer.log 如果出现以下图示选择,选择 NO

d)安装nvidia-container-runtime

curl -s -L https://nvidia.github.io/nvidia-container-runtime/centos7/nvidia-container-runtime.repo | sudo tee /etc/yum.repos.d/nvidia-container-runtime.repo 
sudo yum install nvidia-container-runtime

>>安装驱动常见问题

问题1)ERROR: Unable to find the kernel source tree for the currently running kernel……
这是由于没有安装Kernel模块(如果已安装依然提示,可以尝试重启几次服务器)

#1)查看内核版本 
uname -r

#2)查看kernel版本列表
yum list |grep kernel

#3)更新yum源
yum -y update

#3)安装对应kernel版本
yum -y install kernel-devel "kernel-devel-uname-r == $(uname -r)"

问题2) …… Error! echo Your kernel headers for kernel 3.10.0-957.el7.x86_64 cannot be found at /lib/modules/3.10.0-957.el7.x86_64/build or /lib/modules/3.10.0-957.el7.x86_64/source……
这是由于安装的Kernel与内核版本不匹配(与问题1类似)
如若不存在对应版本则更换镜像源重新检索安装,或者网上搜索下载对应安装包手动安装
问题3)ERROR: The Nouveau kernel driver is currently in use by your system. This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding……
这是由于Nouveau与Nvidia驱动冲突,需要禁用Nouveau:

#1) 先把nouveau驱动加入黑名单:  
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf

#2) 然后使用dracut重新建立initramfs p_w_picpath file:
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
dracut -v /boot/initramfs-$(uname -r).img $(uname -r)

#3) 最后重启服务器,然后检查nouveau driver确保没有被加载!
lsmod | grep nouveau

问题4)No package dkms available.
这是由于镜像源依赖找不到对应安装包,请尝试更新镜像源或者更换镜像源:Centos更改镜像源 或者尝试使用以下离线包(此为常规版本,不一定完全匹配)
https://suanpan-public.oss-cn-shanghai.aliyuncs.com/suanpan-installer/gpu-installer.zip

yum -y -e 0 localinstall *.rpm

二、配置docker runtime

a)增加docker配置(daemon.json不存在则新建)

vim /etc/docker/daemon.json 

{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}

b)然后重新加载配置并重启docker

systemctl daemon-reload && systemctl restart docker

c)执行docker info命令,查看Default Runtime是否已经修改为了nvidia


三、检查容器GPU调用是否成功

:~$ docker run -it --rm --gpus all ubuntu nvidia-smi 

Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
f476d66f5408: Pull complete
8882c27f669e: Pull complete
d9af21273955: Pull complete
f5029279ec12: Pull complete
Digest: sha256:d26d529daa4d8567167181d9d569f2a85da3c5ecaf539cace2c6223355d69981
Status: Downloaded newer image for ubuntu:latest
Tue May 7 15:52:15 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.116 Driver Version: 390.116 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P4 Off | 00000000:00:04.0 Off | 0 |
| N/A 39C P0 22W / 75W | 0MiB / 7611MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+