Useful PyTorch tricks

2 minute read | 08-04-2021

PyTorch is one of two major machine learning libraries for implementing deep neural networks. PyTorch errors are not the most informative at times and it's frustrating to having to debug them from extensive search on stack-overflow and github issues. This post is a collection of common issues and how to resolve them. I will update this periodically as and when I hit new issues.

Cuda capability sm_86 not compatible

When using python virtual environments pip compiles pytorch with cuda 10 by default which does not support the newer GPUs which requilre sm_86. You will be greeted with the following error:

TX A5000 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the RTX A5000 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

You can resolve this by manually specifying cuda 11 for pip installation. The way to do it is

pip install --upgrade pip setuptools wheel
pip install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
Pytorch Lightning multi-gpu DDP killed

Pytorch Lightning is a nifty library for enabling easy usage of PyTorch without having to worry about hardware, training code etc allowing you to focus on your research idea. For multi-gpu training, Distributed Data Parallel also referred to as DDP is faster than Data Parallel (also referred to as DP).

In pytorch lightning when using multiple gpus, you specify the accelerator as either dp or ddp. When using ddp sometimes the training doesn't start and dies with the message killed without throwing any error message. This is often common when using kubernetes clusters (I discovered this on the Nautilus cluster used by UCSD).

To fix this, set auto_select_gpus = True in the trainer configuration. For eg

Trainer = (
	gpus = 2, auto_select_gpus = True, plugins=DDPPlugin(find_unused_parameters = False)
)
Finding out which device a tensor is on
x.get_device()
Clearing GPU cache
torch.cuda.empty_cache()
numpy to tensor expected Double bug got float

You would expect that converting from a numpy tensor to double would be simple. The numpy dtype np.double is not the equivalent torch.double. Instead you need to use np.single.

a = np.array([1,2,3]).astype(np.single)

Comments