Getting Started With Stable Diffusion 3.5 On Python

So you want to follow the hype and generate some images with Stablility AI`s shiny new Stable Diffusion 3.5 model (SD 3.5). You find the model on Hugging Face, and hey, there is a code example to try it out. And it works!?

Absolutely not!

Inconveniently there are a lot more steps to take and considerations to make before your python script will generate an AI image, especially on consumer hardware with little VRAM. This article shows how to really do it, even on a laptop GPU with only 6GB of VRAM. As such it is an adapted collection of other material available on the web.

So, let’s get you your:

System Prerequisites

All examples in this article are tested using Endeavour OS on a Thinkpad X1 Extreme Laptop with Intel and Nvidia RTX 3060 Laptop hybrid graphics (6GB VRAM).

Make sure your packages are up to date and you have the Nvidia drivers and cuda installed:

yay -Syu
yay -S nvidia-inst cuda

Hugging Face Account and License Considerations

We assume you have a Hugging Face account and can log in via command line. If not so, work through the Huggingface Hub Quick Start Guide.

In this article we will use the Stability AI models Stable Diffusion 3.5 large and Stable Diffusion 3.5 medium. You need to visit both model pages and accept their respective license in order to use them locally.

Python and Cuda

In this section we explain how to set up your Python environment to be able to use the Cuda libraries.

Create the Python Environment and Install Dependencies

Most AI image generation apps use slightly older Python versions. Most common are 3.10 or 3.11. If your distro already provides one of these versions as standard, you should be good. If not, all examples in this article work with Pyton 3.13 on Endeavour OS.

We use a fresh virtualenv:

python3.13 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip

For the examples in this article we use the following requirements:

requirements.txt

torch
diffusers
bitsandbytes
transformers[sentencepiece]
huggingface_hub[cli]
protobuf
accelerate
optimum

Install them in the virtual environment:

pip install -r requirements.txt

Get Python Infos

The following Python script outputs some information about the installed software versions and the available hardware. The script is based on an answer from StackOverflow by texasdave. If it works and your GPU shows up the image generation script should be able to use your GPU through Python.

#!/usr/bin/env python

import torch
import sys
import os
from subprocess import call

print('_____Python, Pytorch, Cuda info____')
print('__Python VERSION:', sys.version)
print('__pyTorch VERSION:', torch.__version__)
print('__CUDA RUNTIME API VERSION')
os.system('nvcc --version')
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('_____nvidia-smi GPU details____')
call(["nvidia-smi", "--format=csv", "--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free"])
print('_____Device assignments____')
print('Number CUDA Devices:', torch.cuda.device_count())
print ('Current cuda device: ', torch.cuda.current_device(), ' **May not correspond to nvidia-smi ID above, check visibility parameter')
print("Current device name: ", torch.cuda.get_device_name(torch.cuda.current_device()))
available_devices = [str(i) + ': ' + torch.cuda.get_device_name(d) for (i, d) in enumerate(range(torch.cuda.device_count()))]
print('All available devices: ', ', '.join(available_devices))

Generate Images With Stable Diffusion 3.5 and Python

Now that we can use Cuda on Python we can start generating images. Generating images with Stable Diffusion requires a lot of RAM on the graphics card (VRAM). We will explore different setups for cards with fewer and fewer VRAM.

Run the original Example

This is the original image generation snippet from the huggingface model page. The example requires a GPU with more than 24GB of VRAM. If you have such a GPU, congratulations. Otherwise, consider the subsequent sections.

#!/usr/bin/env python

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3.5-large", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

image = pipe(
    "A capybara holding a sign that reads Hello World",
    num_inference_steps=28,
    guidance_scale=3.5,
).images[0]
image.save("out/capybara.png")

Selecting a GPU

If you have multiple GPUs installed in your system or are using an external GPU case through Thunderbolt, you might have to select the external GPU before running the pipeline. The Python info script should give you the index of your external GPU. Then add the following code snippet before the pipeline run.

# Select GPU, use index from the info script above
pipe = pipe.to('cuda:1')

Running a Stable Diffusion 3.5 Variant With Less VRAM Requirements

If your GPU has less than 24GB of VRAM, the easiest solution to make it run is to switch to a smaller variant of Stable Diffusion 3.5. There is a medium sized model available, called
"stabilityai/stable-diffusion-3.5-medium". Simply switch the from_pretrained statement in the code snippet before to the new model:

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3.5-medium", torch_dtype=torch.bfloat16)

Quantize Stable Diffusion 3.5 for Less VRAM Consumption Using Diffusers

If you have even less VRAM available, you can try quantizing the model before using it. Quantization converts the model weights into a lower precision format to save VRAM and optimize performance. The model page already provides an example quantizing with NF4 and diffusers:

#!/usr/bin/env python

from diffusers import BitsAndBytesConfig, SD3Transformer2DModel
from diffusers import StableDiffusion3Pipeline
import torch

model_id = "stabilityai/stable-diffusion-3.5-large"

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model_nf4 = SD3Transformer2DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    quantization_config=nf4_config,
    torch_dtype=torch.bfloat16
)

pipeline = StableDiffusion3Pipeline.from_pretrained(
    model_id, 
    transformer=model_nf4,
    torch_dtype=torch.bfloat16
)
pipeline.enable_model_cpu_offload()

prompt = "A capybara holding a sign that reads Hello World"

image = pipeline(
    prompt=prompt,
    num_inference_steps=28,
    guidance_scale=4.5,
    max_sequence_length=512,
).images[0]
image.save("whimsical.png")

Quantize Stable Diffusion 3.5 with Quanto for Even Less VRAM Consumption

But what if you have even less VRAM available, say a laptop GPU with only 6GB of VRAM? Well, the above approach will need too much memory for your card. But we can use quanto to convert the model weights into an even lower precision. Following the guide of Paul and Carvoysier on quantization with quanto we quantize the transformer and third text encoder of SD3.5. The lowest usable quantization is qint4. Using qint2 only generates incoherent images.

#!/usr/bin/env python

from diffusers import StableDiffusion3Pipeline
from optimum.quanto import freeze, qint4, quantize
import torch

model_id = "stabilityai/stable-diffusion-3.5-large"

pipeline = StableDiffusion3Pipeline.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16
)

quantize(pipeline.transformer, weights=qint4)
freeze(pipeline.transformer)

quantize(pipeline.text_encoder_3, weights=qint4)
freeze(pipeline.text_encoder_3)

pipeline.enable_model_cpu_offload()

prompt = "A capybara holding a sign that reads Hello World"

print('Before pipeline run')
image = pipeline(
    prompt=prompt,
    num_inference_steps=28,
    guidance_scale=4.5,
    max_sequence_length=512,
).images[0]
image.save("out/qint4.png")

The above approach uses just under 5.9GB of VRAM during inference. Loading the model into VRAM takes a long time and inference itself is also quite slow at 21 seconds per iteration. Using the medium model, the script uses 2.8G of VRAM during inference and takes around 5 seconds per iteration. Unfortunately the pipeline also takes a long time to load.

There we go, Stable Diffusion 3.5 large on a RTX 3060 Laptop GPU.

Troubleshooting

Most likely you will face some issues setting up your image generation. In the following sections we present some tools you can use to get to the bottom of your problems as well as a Thunderbolt specific workaround.

Information From Nvidia Tooling

The nvidia driver package comes with some tools, most notably nvidia-smi which can show you information about the GPUs detected by the system.

# List all GPUs using the Nvidia tools
[benjamin@MrMoon ~]$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU (UUID: GPU-d385c279-8ed2-2321-e417-c6fc3aaed03e)
# Get detailed infos
[benjamin@MrMoon ~]$ nvidia-smi
Sat Feb  1 00:43:53 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.77                 Driver Version: 565.77         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060 ...    Off |   00000000:01:00.0  On |                  N/A |
| N/A   49C    P8             13W /   80W |     110MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1425      G   /usr/lib/Xorg                                  83MiB |
+-----------------------------------------------------------------------------------------+

System Information

If you don´t get infos or error messages using nvidia-smi, or are unsure wether your GPU is detected by the system, also consider the following commands to gather more information about what went wrong:

# Check firmware for first GPU
cat /proc/driver/nvidia/gpus/0000:01:00.0/information
# Directory with all Nvidia GPUs
ls /proc/driver/nvidia/gpus/
# List discovered GPUs
sudo lspci -d ::03xx
# List discovered GPUs with drivers
sudo lspci -k -d ::03xx
# Show system logs since boot
sudo journalctl -b
# Show kernel messages
sudo dmesg -H

Additional Tools

The additional tool nvtop shows GPU processor and memory usage.

yay -S nvtop
nvtop

Alternatively, use a watch on nvidia-smi:

watch -d -n 0.5 nvidia-smi

Thunderbolt Power Management Issues

When using external GPUs vie Thunderbolt, you might experience a behaviour where the external GPU disappears after a few seconds or minutes of working correctly. A subsequent reboot fixes the problem, but the GPU will disappear again after a while. The reason could be a problem with the PCIe power management over Thunderbolt. Check the journalctl for log messages like these:

Jan 26 16:03:09 MrMoon kernel: pcieport 0000:00:1d.0: AER: Correctable error message received from 0000:58:01.0
Jan 26 16:03:09 MrMoon kernel: pcieport 0000:58:01.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
Jan 26 16:03:09 MrMoon kernel: pcieport 0000:58:01.0:   device [8086:15da] error status/mask=00000080/00002000
Jan 26 16:03:09 MrMoon kernel: pcieport 0000:58:01.0:    [ 7] BadDLLP

And dmesg might show something like this:

Jan 26 16:20:08 MrMoon kernel: pcieport 0000:20:00.0: AER: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
Jan 26 16:20:08 MrMoon kernel: thunderbolt 0000:22:00.0: AER: can't recover (no error_detected callback)
Jan 26 16:20:09 MrMoon kernel: xhci_hcd 0000:56:00.0: AER: can't recover (no error_detected callback)
Jan 26 16:20:09 MrMoon kernel: nvidia 0000:59:00.0: AER: can't recover (no error_detected callback)
Jan 26 16:20:09 MrMoon kernel: snd_hda_intel 0000:59:00.1: AER: can't recover (no error_detected callback)
Jan 26 16:20:09 MrMoon kernel: pcieport 0000:00:1d.0: unlocked secondary bus reset via: pciehp_reset_slot+0x98/0x140
Jan 26 16:20:09 MrMoon kernel: NVRM: GPU at PCI:0000:59:00: GPU-a971ee5e-7538-ebea-5286-11eeba88666c
Jan 26 16:20:09 MrMoon kernel: NVRM: Xid (PCI:0000:59:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Jan 26 16:20:09 MrMoon kernel: NVRM: GPU 0000:59:00.0: GPU has fallen off the bus.

Adding pcie_aspm=off to the kernel boot parameters fixed the problem for us. This setting might affect power draw negatively. But when the use case is generating images, the difference in power draw should be negligible.

Conclusion

The code snippets on Hugging Face really could use more explanation. We hope you now have a working setup. Have fun generating images!