Large Language Models (LLMs), or AI, are widely adopted by now. Most of us use it daily for work, leisure and private matters. That also means that we share a lot of possibly sensitive information with the LLM. This can be problematic when using 3rd party services like ChatGPT, especially now that OpenAI has to retain all chat logs indefinitely. So how about running an LLM on your own hardware?

We previously measured two popular current-generation llama 3 and deepseek-r1. This time we follow up with Qwen3 and Gemma3, both released in the last months. We want to give you an idea which kind of model will run on what hardware and what kind of performance you can expect.

Test Setup

We test two of the currently popular LLM families, Qwen3 and Gemma3, in quantized and 16bit floating point (fp16) versions for their performance on consumer hardware.
We run the models using ollama 0.9.2 and open-webui as frontend. We collect the measured response tokens after executing the following query:

I need a summary of the book “War and Peace¨. Please write at least 500 words.

For each test, the model fits into the memory of the graphics card (VRAM). Mobile devices are allowed to cool down before each test run.

For Qwen3, the thinking mode was left at default, which is “on”.

Tested Models

In this article we focus on two of the currently popular LLMs:

  • Qwen3, developed by Alibaba, is optimized for deployment on Alibaba Cloud infrastructure, leveraging GPUs and specialized AI accelerators. It excels in advanced conversational AI, content generation, multilingual interactions, and complex reasoning tasks. Typical applications include customer support automation, virtual assistants, translation, summarization, and sentiment analysis. Its scalable architecture ensures efficient integration into cloud-based business solutions, particularly within Alibaba’s ecosystem.

  • Gemma3, created by Google DeepMind, is designed for versatility and efficiency across diverse hardware platforms, including consumer-grade GPUs, CPUs, and edge devices. It effectively handles tasks such as text generation, summarization, conversational AI, and question-answering. Its lightweight, open-source architecture makes it ideal for resource-constrained environments, enabling applications like personal assistants, educational tools, and interactive chatbots. Gemma3’s open-source nature encourages customization, experimentation, and broad adoption in both research and industry contexts.

Model Variant Precision VRAM Size
qwen3:0.6b Q4_K_M 2.2GB
qwen3:4b Q4_K_M 5.2GB
qwen3:8b Q4_K_M 7.5GB
qwen3:14b Q4_K_M 12GB
qwen3:32b Q4_K_M 25GB
qwen3:0.6b fp16 3GB
qwen3:1.7b fp16 5.3GB
qwen3:4b fp16 10GB
qwen3:8b fp16 18GB
qwen3:14b fp16 32GB
gemma3:1b instruct Q4_K_M 1.9GB
gemma3:4b instruct Q4_K_M 6GB
gemma3:12b instruct Q4_K_M 11GB
gemma3:27b instruct Q4_K_M 21GB
gemma3:1b instruct, quantization aware Q4_K_M 2.1GB
gemma3:4b instruct, quantization aware Q4_K_M 6.6GB
gemma3:12b instruct, quantization aware Q4_K_M 12GB
gemma3:27b instruct, quantization aware Q4_K_M 22GB
gemma3:1b instruct fp16 3.1GB
gemma3:4b instruct fp16 11GB
gemma3:12b instruct fp16 31GB
gemma3:27b instruct fp16 63GB
Table 1: tested llms

Table 1 shows the tested LLMs. Ollama uses quantized models by default. If you specify a more detailed tag you can choose different quantizations as well as the non-quantized floating point versions. We select Qwen3 and Gemma3 for their popularity and start with the smallest versions parameter-wise. We continuously choose higher parameter models until they would not fit into any of the test systems VRAM.
The last column in the table shows the size of the model when loaded into VRAM as reported by ollama ps. We can already see here that not all models will fit into the VRAM of the test systems from Table 2.

Tested Systems

System VRAM Architecture Chip
2x Nvidia 5090 64GB Blackwell GB202
Nvidia 5090 32GB Blackwell GB202
Nvidia 4090 24GB Ada Lovelace AD102
Apple M4 Max 64GB M4 40 core M4 Max
Apple M1 Pro 16GB M1 16 core M1 Pro
Nvidia 4070 Mobile 8GB Ada Lovelace AD106
Nvidia 2080 ti 12GB Turing TU102
2x Nvidia P40 48GB Pascal GP102
Nvidia 3060 Mobile 6GB Ampere GA106
Table 2: tested systems

Table 2 shows the test systems avaible to us including their respective architecture and VRAM sizes. The Apple systems as well as the 4070 and 3060 systems are Laptops, the rest are desktop PCs.

Results

We test two of the currently popular LLM families, Qwen3 and g=Gemma3, in quantized and fp16 versions for their performance on the consumer hardware mentioned in Table 2.

Qwen3

Here we test the Qwen3 performance across our tests systems. First in the quantized forma and then the full precision available in the ollama repository.

Qwen3 quantized performance on consumer hardware
Chart 1: Qwen3 quantized performance on consumer hardware

Chart 1 shows the performance data for Qwen3 running in Q4 quantized form. The 14b variant already runs on a wide variety of systems, while the 32b model requires above 24GB of VRAM. The 4090 is therefore unable to run it purely in VRAM. While the M4 is able to run the model because of the unified memory the performance is low at only 15t/s (token/second). The dual P40 setup also only does 7t/s but is of course a much cheaper setup.
Running the model across two consumer grade GPUs comes with a significant performance penalty. The dual 5090 setup even gets outperformed by the 4090 on the smallest Qwen3 model.

Qwen3 fp16 performance on consumer hardware
Chart 2: Qwen3 fp16 performance on consumer hardware

When switching to the 16 bit precision of Qwen3, no setup is able to run the 32b variant anymore as it takes 70GB of VRAM. Chart 2 shows the remaining possible combinations. The performance of the 14b variant is now about on the level of the quantized 32b model. The 5090 reaches 52t/s, the P40s 9t/s and the M4 13t/s.
The dual 5090 setup is faster than the 4090 again across the board.

Gemma3

Gemma3 is intended to run on single GPUs. But of course VRAM restrictions apply. Chart 3 shows the possible combinations for the quantized version.

Gemma3 quantized performance on consumer hardware
Chart 3: Gemma3 quantized performance on consumer hardware

The results are in line with our findings for previous LLMs. Again the dual 5090 setup experiences a performance penalty compared to the single 5090.

Gemma3 quantization aware performance on consumer hardware
Chart 4: Gemma3 quantized performance on consumer hardware

Gemma3 comes in a quantization aware variant (QAT). These model versions are not just quantized from the full precision variant after training but include quantization information during training. Google states: “QAT simulates low-precision operations during training to allow quantization with less degradation afterwards for smaller, faster models while maintaining accuracy.”

Chart 4 shows the performance of the quantization aware versions. Apparently the VRAM and performance impact is low. We cannot speak for the quality of the generated results, but any gains here could be very cheap to have.

Gemma3 fp16 performance on consumer hardware
Chart 5: Gemma3 fp16 performance on consumer hardware

Even the largest Gemma3 full precision variant will run on some of our test setups. Albeit pushing the VRAM limit with 63GB. There is also a caveat with the 27b variant that we discuss in the following section. Chart 5 shows the performance data.
Notably the M4 can just run the model but only achieves 3t/s.

The Gemma3 27b Memory Issue

When running gemma3:27b-it-fp16 on the dual 5090 and M4 setup, ollama consistently chooses to run one layer on the CPU. We aren’t able to force it to use the GPU only. We choose to count the result anyway, as the performance impact apparently was minimal according to the ollama community. We opened a bug ticket for this anyway: https://github.com/ollama/ollama/issues/11162

Conclusion

Like in our previous tests with llama 3 and deepseek-r1 consumer hardware is capable of running at least smaller or quantized variants of current generation LLMs with satisfactory performance. If Google’s claims on the quality of the output of the quantization aware model variants hold true these will be a very interesting option for local use.
Again, older server hardware like the Nvidia P40 or Quadro P6000 may give you access to the largest model variants at a lower price point than with current generation GPUs.