CUDA becomes unavailable? maybe it’s not your fault

When something breaks in our Python environments, we tend to think that this is our faults, but it’s not always the case!

It’s a Monday morning. I turned on the VM instance on Google Cloud Platform (GCP), then started my deep learning experiments. I noticed that the experiment was ran on CPU, but I expected it to run on GPU. I did a quick test and got the following:

>>> import torch
>>> torch.cuda.is_available()
/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/conda/conda-bld/pytorch_1614378098133/work/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0

False

Why can’t PyTorch detect CUDA anymore? I was so confused since everything worked perfectly fine before.

As a programmer, we google/stack overflow every time we get an error message. Most answers (like here and here) suggested that this could be due to incorrectly installing PyTorch, NVIDIA driver or CUDA. I was thinking: maybe there were some updates to this instance and suddenly some packages are not compatible with each other? I created some new conda environments and installed related packages from scratch, but still got the error.

Are there any updates to the instance? I checked the instance’s logs, and didn’t find any update activities.

I found this nice blog post on how to troubleshoot CUDA issues on cloud instances. It helps you diagnose the problem at different levels. It also suggests that automatic updates on GCP instances may screw things up, and shows how to turn off automatic updates (I’m not sure if it’s a good idea). I tried approaches in this blog post but still can’t solve my problem…

I was clueless and kept surfing stack overflow. I found this very recent question. It seems like other users were also experiencing this issue. This thread leads me to Google’s issue tracker, and I was not aware of this platform before!

Someone at Google opened this issue. This issue is exactly what I was dealing with. I was much more relieved because this is a common GCP issue.

Later, another Google engineer posted the fix.

I browsed through other topics on the issue tracker and there were issues related to other GCP services. Now I know that this will be one of the go-to places when something goes wrong on GCP.

Take away message: when something doesn’t work properly on a GCP instance, take a look at the issue tracker besides Stack Overflow, maybe it’s a common problem!