Qwen3 and Gemma3 Performance on Consumer Hardware
Large Language Models (LLMs), or AI, are widely adopted by now. Most of us use it daily for work, leisure and private matters. That also means that we share a lot of possibly sensitive information with the LLM. This can be problematic when using 3rd party services like ChatGPT, especially now that OpenAI has to retain all chat logs indefinitely. So how about running an LLM on your own hardware?
We previously measured two popular current-generation llama 3 and deepseek-r1. This time we follow up with Qwen3 and Gemma3, both released in the last months. We want to give you an idea which kind of model will run on what hardware and what kind of performance you can expect.
Test Setup
We test two of the currently popular LLM families, Qwen3 and Gemma3, in quantized and 16bit floating point (fp16) versions for their performance on consumer hardware.
We run the models using ollama 0.9.2 and open-webui as frontend. We collect the measured response tokens after executing the following query:
I need a summary of the book “War and Peace¨. Please write at least 500 words. |
For each test, the model fits into the memory of the graphics card (VRAM). Mobile devices are allowed to cool down before each test run.
For Qwen3, the thinking mode was left at default, which is “on”.
Tested Models
In this article we focus on two of the currently popular LLMs:
Qwen3, developed by Alibaba, is optimized for deployment on Alibaba Cloud infrastructure, leveraging GPUs and specialized AI accelerators. It excels in advanced conversational AI, content generation, multilingual interactions, and complex reasoning tasks. Typical applications include customer support automation, virtual assistants, translation, summarization, and sentiment analysis. Its scalable architecture ensures efficient integration into cloud-based business solutions, particularly within Alibaba’s ecosystem.
Gemma3, created by Google DeepMind, is designed for versatility and efficiency across diverse hardware platforms, including consumer-grade GPUs, CPUs, and edge devices. It effectively handles tasks such as text generation, summarization, conversational AI, and question-answering. Its lightweight, open-source architecture makes it ideal for resource-constrained environments, enabling applications like personal assistants, educational tools, and interactive chatbots. Gemma3’s open-source nature encourages customization, experimentation, and broad adoption in both research and industry contexts.
Model | Variant | Precision | VRAM Size |
---|---|---|---|
qwen3:0.6b | Q4_K_M | 2.2GB | |
qwen3:4b | Q4_K_M | 5.2GB | |
qwen3:8b | Q4_K_M | 7.5GB | |
qwen3:14b | Q4_K_M | 12GB | |
qwen3:32b | Q4_K_M | 25GB | |
qwen3:0.6b | fp16 | 3GB | |
qwen3:1.7b | fp16 | 5.3GB | |
qwen3:4b | fp16 | 10GB | |
qwen3:8b | fp16 | 18GB | |
qwen3:14b | fp16 | 32GB | |
gemma3:1b | instruct | Q4_K_M | 1.9GB |
gemma3:4b | instruct | Q4_K_M | 6GB |
gemma3:12b | instruct | Q4_K_M | 11GB |
gemma3:27b | instruct | Q4_K_M | 21GB |
gemma3:1b | instruct, quantization aware | Q4_K_M | 2.1GB |
gemma3:4b | instruct, quantization aware | Q4_K_M | 6.6GB |
gemma3:12b | instruct, quantization aware | Q4_K_M | 12GB |
gemma3:27b | instruct, quantization aware | Q4_K_M | 22GB |
gemma3:1b | instruct | fp16 | 3.1GB |
gemma3:4b | instruct | fp16 | 11GB |
gemma3:12b | instruct | fp16 | 31GB |
gemma3:27b | instruct | fp16 | 63GB |
Table 1 shows the tested LLMs. Ollama uses quantized models by default. If you specify a more detailed tag you can choose different quantizations as well as the non-quantized floating point versions. We select Qwen3 and Gemma3 for their popularity and start with the smallest versions parameter-wise. We continuously choose higher parameter models until they would not fit into any of the test systems VRAM.
The last column in the table shows the size of the model when loaded into VRAM as reported by ollama ps
. We can already see here that not all models will fit into the VRAM of the test systems from Table 2.