GPU Issues

Troubleshooting CUDA, cuDNN, ONNX Runtime, and GPU memory problems.

GPU issues are the most common category of problems reported by Recaster users. This page covers diagnosis and solutions for NVIDIA GPU acceleration, driver compatibility, and platform-specific GPU limitations.

GPU Not Detected

If Recaster does not detect your GPU and falls back to CPU processing:

Update NVIDIA drivers — Download the latest drivers from nvidia.com/drivers. Driver version 535 or later is recommended.
Check VRAM — Ensure your GPU has at least 2 GB VRAM for face swapping and 4 GB for training.
Verify NVIDIA GPU — Recaster only supports NVIDIA GPUs for CUDA acceleration. AMD and Intel GPUs are not supported (CPU fallback is used).
Restart after driver update — A full system restart is required after installing new GPU drivers.

Quick Verification

Run nvidia-smi in your terminal to verify your GPU is recognized by the NVIDIA driver. If this command fails, the driver is not installed correctly.

CUDA Errors

CUDA_ERROR_INVALID_PTX

This error indicates a GPU/CUDA version mismatch. The compiled PTX code is not compatible with your GPU architecture.

RTX 5000 series: This is expected. RTX 5000 GPUs require CUDA 12.8+ with compute capability 12.0, but current TensorFlow wheels only support up to compute capability 9.0. See the RTX 5000 Workarounds section below.
Other GPUs: Update your NVIDIA driver to the latest version. This usually resolves the PTX mismatch for supported GPU architectures.

CUDA Out of Memory

If you see CUDA out-of-memory errors during processing:

Close other GPU-intensive applications (games, other AI tools, video editors).
Reduce tile size in upscaling settings from 512 to 256 or 128.
Use smaller model variants (InSwapper 128-FP16 instead of Ghost 3).
Lower the GPU memory limit in advanced settings.
Process lower-resolution input files.

cuFuncGetName Errors

This error typically appears on remote instances and indicates an ONNX Runtime or cuDNN version mismatch. The fix is to reinstall the correct package versions:

pip install --force-reinstall onnxruntime-gpu==1.19.2
pip install --force-reinstall "numpy>=1.23.0,<2.0.0"
pip install nvidia-cudnn-cu12==9.1.0.70

cuDNN Installation

cuDNN (CUDA Deep Neural Network library) is required for GPU acceleration on remote instances. If you see "libcudnn.so.9 not found" errors:

Install cuDNN 9 via pip:

pip install nvidia-cudnn-cu12==9.1.0.70

Find the cuDNN library path:

python -c "import nvidia.cudnn; print(nvidia.cudnn.__path__[0] + '/lib')"

Add the path to LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=$(python -c "import nvidia.cudnn; print(nvidia.cudnn.__path__[0] + '/lib')"):$LD_LIBRARY_PATH

Verify the CUDA provider is available:

python -c "import onnxruntime as ort; print(ort.get_available_providers())"

You should see CUDAExecutionProvider in the output. If only CPUExecutionProvider appears, the cuDNN setup is not correct.

Automatic Fallback

If GPU acceleration is not available, Recaster automatically falls back to CPU processing. This is slower but still functional for all operations.

ONNX Runtime Compatibility

Critical Version Pinning

ONNX Runtime GPU must be version 1.19.2. Versions 1.23 and above have known compatibility issues with NVIDIA driver 535.x and certain CUDA configurations. Do not upgrade unless a new compatible version has been verified by the Recaster team.

If face processing hangs, shows 0% GPU utilization, or produces garbage output, you likely have an incompatible ONNX Runtime version. Force-reinstall the correct versions:

pip install --force-reinstall onnxruntime-gpu==1.19.2
pip install --force-reinstall "numpy>=1.23.0,<2.0.0"
pip uninstall -y opencv-python-headless
pip install --no-deps "opencv-python>=4.5.0,<4.11.0"

NumPy Version Mismatch

NumPy 2.0+ breaks ONNX Runtime 1.19.2. If you see import errors or unexpected behavior, downgrade NumPy:

pip install --force-reinstall "numpy>=1.23.0,<2.0.0"

OpenCV Headless Conflict

Some packages install opencv-python-headless which conflicts with opencv-python. Recaster requires the non-headless version:

pip uninstall -y opencv-python-headless
pip install --no-deps "opencv-python>=4.5.0,<4.11.0"

RTX 5000 Series Workarounds

Blackwell Architecture Limitations

RTX 5090, 5080, 5070, and 5060 GPUs use the Blackwell architecture (SM 12.0). TensorFlow dropped native Windows GPU support after v2.10, and current wheels only support up to compute capability 9.0 (RTX 40 series). Local DeepFaceLab training will not work on these GPUs.

Available workarounds for RTX 5000 series users:

1. Remote Training (Recommended)

Use Studio tier cloud GPUs via Vast.ai. Works immediately with no local GPU constraints. Cloud instances use RTX 3090/4090/A100 GPUs that have full CUDA support.

2. WSL2 on Windows

Use Windows Subsystem for Linux 2 with a community fork of DeepFaceLab that supports Blackwell architecture. This runs natively on your GPU through WSL2's GPU passthrough.

3. ONNX Runtime Operations

Quick Recast (face swapping/enhancement) and video upscaling use ONNX Runtime instead of TensorFlow. These operations may work on RTX 5000 GPUs since ONNX Runtime updates CUDA support more frequently.

SwinIR CoreML Issues (macOS)

On macOS, SwinIR models cannot use CoreML acceleration due to Apple's framework not supporting SwinIR's dynamic input shapes. You may see errors like:

"CoreML does not support shapes with dimension values of 0"
"runtime shape has zero elements"

This is a limitation of Apple's CoreML framework, not a Recaster bug. SwinIR automatically falls back to CPU processing on macOS, which is slower (2-5 FPS vs 10-20 FPS with CoreML).

macOS Recommendation

Use Real-ESRGAN instead of SwinIR for local upscaling on macOS. Real-ESRGAN is fully CoreML-accelerated and runs at 10-20 FPS. For SwinIR quality, use remote upscaling on a Linux GPU instance where CUDA is available.

GPU Memory Management

If you are running low on GPU memory, try these optimizations:

Reduce tile size: In upscaling settings, lower the tile size from 512 to 256 or 128. Smaller tiles use less VRAM per processing step.
Lower batch size: If processing multiple faces, smaller batch sizes reduce peak VRAM usage.
Close other GPU applications: Games, video editors, other AI tools, and even some web browsers use GPU memory.
Set GPU memory limit: In advanced upscaling settings, set a GPU memory limit (1/2/4/6/8 GB) to prevent ONNX Runtime from consuming all available VRAM.
Use FP16 models: Half-precision models (e.g. InSwapper 128-FP16) use approximately half the VRAM of full-precision equivalents.

Monitoring GPU Utilization

To monitor GPU usage during processing:

# Watch GPU usage in real-time (updates every 1 second)
nvidia-smi -l 1

# Or use watch for a cleaner display
watch -n 1 nvidia-smi

If GPU utilization shows 0% during processing, the GPU provider is not being used. Check the ONNX Runtime provider configuration and cuDNN installation as described above.

Was this page helpful?