The lab just got a new server configured with an NVIDIA 1080TI graphics card for deep learning. The first thing after setting up the machine was figuring out how to configure the TensorFlow deep learning environment. Here, I’ll document my process of setting up the environment and the issues I encountered, hoping to help others with similar needs.
The operating system is Ubuntu 18.04.1 LTS, prepared by my senior (you can check with lsb_release -v
). Of course, the first step is to Google it. I found an article: setup-an-environment-for-machine-learning-and-deep-learning-with-anaconda-in-windows. The next step is to follow the instructions.
The setup process is divided into 5 steps:
- Download Anaconda
- Install Anaconda & Python (Ubuntu comes with Python 3)
- Update Anaconda
- Install CUDA & cuDNN (cuDNN does not need to be installed manually, often)
- Install TensorFlow & Keras
- CUDA version switching (not needed if installed correctly)
Steps 1, 2, and 3 are standard and relatively simple. I’ll briefly explain them.
1. Download Anaconda
Go to Anaconda to download the Linux version of Python 3.7. Since the server is Ubuntu 18, I downloaded the Python 3.7 version.
2. Install Anaconda & Python (Ubuntu comes with Python 3)
bash Anaconda3-5.3.1-Linux-x86_64.sh
Follow the prompts to successfully install it. If you use zsh as your default shell, like I do, you’ll need to copy the following code that Anaconda automatically adds to .bashrc during installation:
Copy this to the end of your .zshrc file, then source ~/.zshrc to activate it. Type conda in the terminal, and if you see the help message, it’s installed.
3. Update Anaconda
Enter the following commands in the terminal to update conda:
conda update conda
conda update --all
4. Install CUDA & cuDNN
The key part is the installation of CUDA and cuDNN. I had a lot of trouble with this step. Here, I’ll tell you the correct way.
First, go to the NVIDIA Developer website to download the CUDA 9.0 version for your graphics card. It’s crucial to note that TensorFlow only supports CUDA 9.0 (emphasis added; the official website defaults to version 10.0, which caused me a lot of trouble).
Once CUDA is downloaded, it’s simple:
sudo dpkg -i cuda-repo-ubuntu1704-9-0-local_9.0.176-1_amd64-deb
sudo apt-key add /var/cuda-repo-9-0-local/7fa2af80.pub
sudo apt-get update
sudo apt-get install cuda
After installation, you need to downgrade the GCC version and add environment variables. Use the following commands:
sudo apt install gcc-5 g++-5
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 50
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-5 50
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:/usr/local/cuda/extras/CPUTI/lib64
export CUDA_HOME=/usr/local/cuda-9.0/bin
export PATH=$PATH:$LD_LIBRARY_PATH:$CUDA_HOME
cuDNN Installation
You need to register an account on NVIDIA Developer, then download the runtime and develop .deb packages and install them. Use dpkg for installation
sudo dpkg -i libcudnn7_7.0.5.15-1_cuda9.0_amd64.deb
sudo dpkg -i libcudnn7-dev_7.0.5.15-Bcuda9.0_amd64.deb
sudo dpkg -i libcudnn7-doc_7.0.5.15-1+cuda9.0_amd64.deb # Optional
5. Install TensorFlow & Keras
You can directly install them using Anaconda:
conda install -c anaconda tensorflow-gpu
conda install -c conda-forge keras-gpu
The -gpu suffix indicates the GPU version. Without it, you’d install the CPU version. Since the server has a graphics card, it’s better to download the GPU version to leverage its advantages.
6. CUDA Version Switching (not needed if installed correctly)
Use the following command to check if the TensorFlow environment is set up:
python -c "import tensorflow as tf;"
If you get an error like this, it means there’s a CUDA issue. After searching, I found it’s because libcublas.so.9.0 (which is part of CUDA 9.0) cannot be found. This was due to initially installing the latest CUDA 10, which was a major pitfall.
So, you’ll need to uninstall CUDA 10:
sudo apt-get remove cuda*
sudo apt remove --purge nvidia*
rm /etc/apt/sources.list.d/cuda-10-0-local-10.0.130-410.48.list
sudo apt autoremove
sudo rm -rf /var/cuda-repo-10-0-local-10.0.130-410.48
sudo rm -rf /usr/local/cuda*
Run all these commands, and then reinstall as described in step 4. After that, running python -c “import tensorflow as tf;” should no longer show errors. Next, use the following to verify the successful installation of the TensorFlow GPU version. You should see that the 1080Ti has been successfully recognized by TensorFlow.
from tensorflow.python.client import device_lib
local_device_protos = device_lib.list_local_devices()
Summary
This guide covered the entire process of setting up a TensorFlow deep learning environment on Ubuntu, from Anaconda installation to configuring CUDA and TensorFlow-GPU. The most significant hurdle is often the CUDA version compatibility; ensure you install CUDA 9.0 as newer versions may not be supported by TensorFlow’s GPU implementation. By following these steps, you should have your deep learning server up and running smoothly.