gpu云主机使用Docker部署深度学习环境(中)

 2023-12-24  阅读 2  评论 0

摘要:GPU云服务器使用Docker部署深度学习环境(下) 本文讲新睿云GPU加速云服务器,利用Deepo搭建深度学习环境! 一、安装nvidia-docker: 本文是系列文章 上篇请查看《gpu云服务器使用Docker部署深度学习环境(上)》 中篇请查看《gpu云服务器使用Docker部署深度学习环境(中)》

gpu云主机使用Docker部署深度学习环境(中)

GPU云服务器使用Docker部署深度学习环境(下)

本文讲新睿云GPU加速云服务器,利用Deepo搭建深度学习环境!

、安装nvidia-docker:

本文是系列文章

上篇请查看《gpu云服务器使用Docker部署深度学习环境(上)》

中篇请查看《gpu云服务器使用Docker部署深度学习环境(中)》

下篇请查看《gpu云服务器使用Docker部署深度学习环境(下)》

单独安装Docker之后还无法使用带GPU的深度学习机器,需要再安装一下英伟达出品的Nvidia-docker。

1.安装:

# Add the package repositories

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add –

$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

$ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

$ sudo systemctl restart docker

2.测试:

现在可以测试了,以下是在一台4卡1080TI机器上的测试结果,宿服务器CUDA版本为9.2:

docker run –gpus all nvidia/cuda:9.0-base nvidia-smi

第一次运行的时候结果大致如下,需要从官方镜像拉取:

 

测试

完整用法可以参考nvidia-docker官方github:

#### Test nvidia-smi with the latest official CUDA image

$ docker run –gpus all nvidia/cuda:9.0-base nvidia-smi

 

# Start a GPU enabled container on two GPUs

$ docker run –gpus 2 nvidia/cuda:9.0-base nvidia-smi

 

# Starting a GPU enabled container on specific GPUs

$ docker run –gpus ‘”device=1,2″‘ nvidia/cuda:9.0-base nvidia-smi

$ docker run –gpus ‘”device=UUID-ABCDEF,1′” nvidia/cuda:9.0-base nvidia-smi

 

# Specifying a capability (graphics, compute, …) for my container

# Note this is rarely if ever used this way

$ docker run –gpus all,capabilities=utility nvidia/cuda:9.0-base nvidia-smi

二、测试Tensorflow Docker镜像

几乎所有的深度学习框架都提供了Docker镜像,这里以Tensorflow Docker镜像为例,来玩一下 Tensorflow,Tensorflow Docker的版本是通过Docker Tag区分的:

 

Tensorflow Docker

1.默认拉取的Tensorflow镜像是CPU版本:

docker pull tensorflow/tensorflow

运行bash:

docker run -it tensorflow/tensorflow bash

 

Tensorflow

现在可以在Tensorflow Docker下执行python,可以看得出这是CPU版本的Tensorflow:

root@ed70090804e5:/# python

Python 2.7.15+ (default, Nov 27 2018, 23:36:35) [GCC 7.3.0] on linux2

Type “help”, “copyright”, “credits” or “license” for more information.>>> import tensorflow as tf>>> tf.__version__’1.14.0′>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))-08-15 09:32:38.522061: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA-08-15 09:32:38.549054: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3598085000 Hz-08-15 09:32:38.550255: I tensorflow/compiler/xla/service/service.cc:166.6] XLA service 0x556fa8e5dd60 executing computations on platform Host. Devices:-08-15 09:32:38.550282: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>

Device mapping:

/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device-08-15 09:32:38.551424: I tensorflow/core/common_runtime/direct_session.cc:296] Device mapping:

/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device

2.获取GPU版本的Tensorflow-devel镜像:

docker pull tensorflow/tensorflow:devel-gpu

运行:

docker run –gpus all -it tensorflow/tensorflow:devel-gpu

这里有报错:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused “process_linux.go:430: container init caused /”process_linux.go:413: running prestart hook 0 caused ///”error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli –load-kmods configure –ldconfig=@/sbin/ldconfig.real –device=all –compute –utility –require=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411 –pid=9821 /home/dlfour/dockerdata/docker/overlay2/814a65fbfc919d68014a6af2c3d532216a5a622e5540a77967719fd0c688a0cb/merged]////nnvidia-container-cli: requirement error: unsatisfied condition: brand = tesla////n///”/””: unknown.

仔细看了一下,最新版的Tensorflow GPU Docker 容器需要的是CUDA>=10.0,这台机器是9.2,并不符合,两种解决方案,一种是升级CUDA到10.x版本,但是我暂时不想升级,google了一下,发现这个tag版本可用cuda9.x:1.12.0-gpu ,所以重新拉取Tensorflow相应版本的镜像:

docker pull tensorflow/tensorflow:1.12.0-gpu

运行:

docker run –gpus all -it tensorflow/tensorflow:1.12.0-gpu bash

没有什么问题:

root@c5c2a21f1466:/notebooks# ls1_hello_tensorflow.ipynb  2_getting_started.ipynb  3_mnist_from_scratch.ipynb  BUILD  LICENSE

root@c5c2a21f1466:/notebooks# python

Python 2.7.12 (default, Dec  4 2017, 14:50:18) [GCC 5.4.0 20160609] on linux2

Type “help”, “copyright”, “credits” or “license” for more information.>>> import tensorflow as tf>>> tf.__version__’1.12.0′>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))-08-15 09:52:29.179406: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA-08-15 09:52:29.376334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:

name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582

pciBusID: 0000:05:00.0

totalMemory: 10.92GiB freeMemory: 6.20GiB-08-15 09:52:29.527131: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:

name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582

pciBusID: 0000:06:00.0

totalMemory: 10.92GiB freeMemory: 6.68GiB-08-15 09:52:29.702166: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 2 with properties:

name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582

pciBusID: 0000:09:00.0

totalMemory: 10.92GiB freeMemory: 6.67GiB-08-15 09:52:29.846647: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties:

name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582

pciBusID: 0000:0a:00.0

totalMemory: 10.92GiB freeMemory: 6.68GiB-08-15 09:52:29.855818: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3-08-15 09:52:30.815240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:-08-15 09:52:30.815280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1 2 3

感兴趣的同学还可以尝试Jupyte Notebook版本的镜像,通过端口映射运行,然后就可以通过浏览器测试和学习了。

版权声明:xxxxxxxxx;

原文链接:https://lecms.nxtedu.cn/yunzhuji/113727.html

发表评论:

验证码

管理员

  • 内容1196553
  • 积分0
  • 金币0
关于我们
lecms主程序为免费提供使用,使用者不得将本系统应用于任何形式的非法用途,由此产生的一切法律风险,需由使用者自行承担,与本站和开发者无关。一旦使用lecms,表示您即承认您已阅读、理解并同意受此条款的约束,并遵守所有相应法律和法规。
联系方式
电话:
地址:广东省中山市
Email:admin@qq.com
注册登录
注册帐号
登录帐号

Copyright © 2022 LECMS Inc. 保留所有权利。 Powered by LECMS 3.0.3

页面耗时0.0120秒, 内存占用363.67 KB, 访问数据库18次